MEANINGFUL CLOUDS: TOWARDS A NOVEL INTERFACE FOR DOCUMENT VISUALIZATION

Transcription

1 MEANINGFUL CLOUDS: TOWARDS A NOVEL INTERFACE FOR DOCUMENT VISUALIZATION Dan Watters DePaul University Chicago, IL USA iamdanwatters@yahoo.com ABSTRACT This paper explores text clouds as a means of semantically visualizing a document and proposes development of a tool to extract and display contextual data from text, helping users perceive the meaning of a document at a glance, and speeding the task of search result summary evaluation. KEY WORDS Tag Cloud, Text Cloud, Folksonomy, Data Visualization, Text Mining, Automatic Summarization, Categorization, Term Extraction INTRODUCTION Popular Web 2.0 social bookmarking websites such as Flickr, Delicious, and Connotea, apply user-generated keywords or tags in a flat, non-hierarchical manner known as folksonomy to their collective content, displaying the most popular tags in a tag cloud visualization providing additional contextual meaning and findability. The term folksonomy was originated in 2004 by Thomas Vander Wal, combining folk with taxonomy, and represents a bottom-up categorization of terms based on community consensus rather than the top-down hierarchical taxonomy commonly employed by traditional library scientists [23]. A tag cloud is a list of tags arranged visually, with added meaning through the use of contrasting size, weight, and color of specific navigable text labels based on each tag s frequency-ofoccurrence within the group collective [15]. Accordingly, the more popular a tag term becomes, the greater the size and weight of the corresponding label displayed in the tag cloud visualization. Tag clouds therefore, provide a summary or semantic view, of the most commonly-used collective concepts generated by users for a particular subject or category [20]. Figure 1: A tag cloud showing Flickr s all time most popular tags demonstrates tag size based on frequency-of-occuence within the collective group. Although the text cloud is identical in appearance to a tag cloud, it differs in function; and is used primarily as an aid in the analysis and comprehension of bodies of text. Rather than showing user-tagged labels representing collective content, text clouds display an automated representation of the most frequently-occurring keywords within a particular document or corpus, acting as a data structure or executive summary on steroids [21]. The visual display of keywords in a text cloud acts much like an unstructured table of contents or outline, allowing users to quickly gain a sense of the document s major themes or meaning [22]. Current uses for text clouds include summarizing abstracts returned from search queries to a biomedical database (PubCloud), and providing visual summaries of documents uploaded from the Web or other sources (TagCrowd) [18][21]. Text clouds illustrate data in a visually appealing and elegant manner. Steinbock states, When we look at a text cloud, we see not only an informative, beautiful image that communicates much in a single glance; we see a whole new perspective on text [25]. 1

2 cognitive load often associated with information retrieval tasks on the Web. Figure 2: The text cloud of Lincoln s Gettysburg Address, created by TagCrowd, aids information retrieval by displaying key terms found within the text. However, a text cloud utilizing term frequency-ofoccurrence as the sole measure of a document s meaning presents a limited approach that could gain from integrating additional text mining and visualization functions to provide richer data capabilities to users seeking to interpret and evaluate the contents of a document [8]. This paper describes a conceptual vision, presents examples, and provides recommendations for development of a prototype which would apply automatic categorization, summarization, and keyphrase extraction functions to a document or text. The results would be displayed in a text cloud, providing users a visual reference for evaluation, and aiding the task of information search and retrieval. LANDSCAPES OF MEANING Before visiting an unfamiliar location, a common practice is to consult a map for directions and familiarity with the landscape to be traveled. Apply the mental model of a document as unexplored territory, and the prototype to be developed as a roadmap providing context and orientation, and it becomes apparent a document can be viewed as a landscape of meaning. Consider a set of points of interest, of peaks and valleys each peak representing a focal node within the document i.e. the most important extracted keyphrases and corresponding content. The topographical depictions of semantic landscapes created by Frid-Jiminez offer further inspiration regarding meaningful visualization of data from a document or text [?]. Google Maps provides another analogy. The proposed prototype to be developed - called CloudMine, would apply the following text mining functions to a body of text and display a visual overview in the form of a text cloud, providing the context, background, and key points needed for users to quickly interpret a document's essential content or meaning. Automatic categorization would provide a contextual frame of reference such as a particular nation, state, or geographic location. Summarization presents an overview, or in our map analogy the lay of the land depicting a view of the map. Lastly, extracted keyphrases can be compared to geographic points-of-interest such as restaurants, gas stations, or a particular address. The goal of CloudMine, is to provide users with an at-a-glance interpretation of the essential meaning of a document so as to reduce the Figure 3: Conceptual model for CloudMine. A roadmap for users intended to visualize and interpret a document s key elements and meaning [26]. CATEGORIZATION Automatic categorization assigns documents to an appropriate directory within a taxonomy based on machine processing and evaluation rather than human experts. A taxonomy, or hierarchical classification system, organizes unstructured information into an ordered, logical structure increasing findability [2]. Some categorization tools (classifiers) are already being used to classify documents for information retrieval within proprietary and academic taxonomies [9]. Algorithms with differing approaches, such as naïve Bayesian, k- nearest neighbor (knn), and support vector machines (SVM) are used to calculate the statistical probability a document belongs in a particular category [2]. Unfortunately, algorithms directing a document to a category based on mathematical probability or programmatic rules are prone to error because machines do not understand the concepts required for interpretation in the same manner as humans [2]. While human expert catalogers currently remain the standard, automatic categorization is helpful interpreting vast document repositories or for situations where human resources are not available or economically feasible. Conceptually, the CloudMine prototype would evaluate the content of a body of text and return a high-level category term from an appropriate taxonomy describing similar documents, giving users a sense of context, while providing information to be used as metadata for automatic retrieval via keywords or tags [30]. Although CloudMine could be used to visualize text within a relatively constrained proprietary or academic document management system, the intent is for CloudMine to be able to evaluate and categorize documents found on the Web. Consequently, the Web itself can act as a resource for the classification schema. 2

3 Yahoo! Directory is an existing web repository of a hierarchical taxonomy that can be used as the basis for the automatic assignment of categories to a document text [24]. Using a representative sample of web documents as a training corpus, Labrou and Finin, used Telltale, a classifier using an n-gram algorithm to automatically assign documents to Yahoo! Directory categories [19]. According to Cavnar and Trenkle, n-gram frequency is an effective method for categorizing documents in an unstructured and variable environment and performs well [4]. An n-gram is a small sub-set or sequence of n items from a string of text, usually either text characters or words, such that based on a training set of data, and a given text sequence ( x, the i 1, xi 2,..., xi n) probability P( x of the next character i xi 1, xi 2,..., xi n) or word is predicted [28]. A set of n-grams can be represented as a histogram which can be used as a profile for comparing or matching documents to a particular category by calculating the measure of the distance (or the number of standard deviations from the mean occurrence), and assigning the document to the category with the smallest corresponding distance [4]. Using Yahoo! Directory as the taxonomy, and an n-gram classifier, CloudMine could be developed to evaluate document text and provide appropriate category suggestions. Figure 4: The n-gram categorization process [4]. Top level category Arts & Humanities Business & Economy Computers & Internet Education Entertainment Government Health News & Media Recreation & Sports Reference Regional Science Social Science Society & Culture Sub-Category / Classifier keyword Photography, History, Literature... B2B, Finance, Shopping, Jobs... Hardware, Software, Web, Games... Colleges, K-12, Distance Learning... Movies, TV Shows, Music, Humor... Elections, Military, Law, Taxes... Diseases, Drugs, Fitness, Nutrition... Newspapers, Radio, Weather, Blogs... Sports, Travel, Autos, Outdoors... Phone Numbers, Dictionaries, Quotes... Countries, Regions, U.S. States... Animals, Astronomy, Earth Science... Languages, Archaeology, Psychology... Sexuality, Religion, Food & Drink... While considering the n-gram classification approach as described by Labrou and Finin promising; the purpose of this paper is primarily to explore concepts behind the proposed development of CloudMine, and to suggest promising methods requiring further research. Consequently, implementation of the Telltale classifier on actual document text was considered to be out of scope. For prototyping purposes, an approximation of the previous categorization process was carried out. Nonetheless, the results, shown in Table 2, are instructive. Table 1: The Yahoo! Directory structure [31]. 3

4 Rank Keyword Frequency 1 Machine Computer 61 3 Game 36 4 Digital 35 5 Possible 35 6 Argument 29 7 States 26 8 Interrogator 25 9 Described System Behavior Discrete-state Fact Human Idea Rules Imitation Store Capacity Process 17 Table 2: TagCrowd extraction of the 20 most frequently-occurring terms found within Turing s, Computing Machinery & Intelligence. Using Steinbock s word frequency calculator from TagCrowd, the text of Alan Turing s well-known essay Computing Machinery & Intelligence was parsed and the top 20 keywords identified. Next, the keywords or phrases would need to be be compared by the n-gram algorithm and matched to the closest occurring category. Human inspection reveals three of the top five terms include Machine, Digital, and Computer ; closely identifying the document with the Yahoo! Directory category entitled Computers & Internet. While many documents may not have such clearly identifiable subjects, it seems likely given a reasonable data corpus, that in many cases, an appropriate category or categories could be automatically determined. Although a cursory view of the data returned appears promising, more research and testing are needed to determine the utility and effectiveness of the proposed automatic categorization methods. SUMMARIZATION In spite of considerable progress since the early 60 s, current automatic text summarization tools still do not approach human levels of cognition or knowledge abstraction. However, due to the ever-increasing processing power of the personal computer, large quantities of text can be scanned rapidly and inexpensively; efficiently returning summaries typically based not on an abstraction, or reinterpretation of text, but on extraction of the most relevant segments of the original text, organized into a new text form [16]. Although extraction summarization methods cannot provide the same thoughtful analysis as the human mind, they can quickly and easily return meaning from vast quantities of text. A key component of the conceptual vision of CloudMine is the ability to provide short document summaries. Searching through vast amounts of information on the Web is both time and labor-intensive. Document summaries are useful as a view or sense of a text s meaning [17]. Additionally, research has shown web pages and full-text documents are difficult to download, browse, and use for wireless, mobile, and handheld devices [3]. CloudMine could conceivably serve as a tool providing quick, easily-viewable document summaries, accessed through the Web via wired or wireless applications. CloudMine conceptually summarizes not only a document, but potentially any document found on the Web; a highly variable repository, as opposed to a limited corpus such as a collection of scholarly works or a corporate intranet database. However, traditional text extraction methods have focused on using algorithms to extract key phrases from text by creating a model of appropriate keywords gleaned from an established training set of documents, and then applying the model to related new documents as a means of assigning a comparative value to found key words or phrases [17]. The variability of documents found on the Web presents a unique challenge to the task of automatic summarization, requiring new methods, currently being researched that can among others: Extract key phrases from an unrestricted set of documents. Operate without the need for training documents. Return results quickly. Account for images or other objects found within the document [1]. One promising method for summarizing Web documents, known as sentence-based abstraction, pulls important sentences from document text, forming summaries based on preferred terms such as titles, bold text, and repeated phrases; assembling snippets of information into a comprehensive whole [1]. Although traditional automatic text summarization techniques generally involve comparing a text corpus to a model based on a predefined training set, new research tools are being developed that account for the variable nature of Web documents. SweSum, a free online summarizer originally for Swedish text, was chosen to simulate CloudMine s summarization process. SweSum, uses a 700,000 word dictionary to identify keywords, extracting key sentences from a document based on scores determined by parameters including term frequency (tf), sentence positioning, 4

5 numeric values, and bold or title text, in order to create a condensed summary of the original text [6]. Frequent terms found within the text and matched against the tool s dictionary are considered to be more important, thus earning a higher overall score [6]. Term frequency represents the amount of times a term is found within a document divided by the number of total words, and can be calculated for term t i within a document d j as: ni, j tf i, j [29]. n = k k, j Figure 5: The SweSum automatic summarization process [7]. Baseline Title Term Frequency Position score Sentence length Average lexical connectivity Numerical data Sentence order in text gives the importance of the sentences. First sentence highest ranking - last sentence lowest ranking. Words in the title and immediately following sentences are given a higher score. Open class terms that are frequent in the text are considered to be more important than less frequent terms. The assumption is that certain genres put important sentences in fixed positions. For example, newspaper articles usually have the most important terms in the 4 first paragraphs. Reports on the other hand have many important sentences at the end of the text. The sentence length implies which sentence is the most important. Number terms shared with other sentences. The assumption is that a sentence that shares more terms with other sentences is more important. Sentences containing numerical data are scored higher than the ones without numerical values. All the above parameters are normalized and put in a simple combination function with modifiable weighting. The idea is that high scoring sentences in the original text are kept in the summary. Table 3: Extracted text summarized by SweSum is weighted by the above parameters [7]. Turing s Computing Machinery & Intelligence was uploaded to SweSum s online interface for processing, and the results used as an example of automatic summarization for the CloudMine prototype. Out of a total document word count of 11,750 words, the shortest allowable (1%) summary extraction yielded the following 123 word summary: An interesting variant on the idea of a digital computer is a "digital computer with a random element." These have instructions involving the throwing of a die or some equivalent electronic process; one such instruction might for instance be, "Throw the die and put the-resulting number into store 1000." Sometimes such a machine is described as having free will (though I would not use this phrase myself), It is not normally possible to determine from observing a machine whether it has a random element, for a similar effect can be produced by such devices as making the choices depend on the digits of the decimal for. This machine could be described abstractly as follows. (It would not be a discrete-state machine however [Summary extraction by SweSum]. Ideally, an effective summary should communicate the essence of a document as succinctly as possible. SweSum s summary seems longer than necessary, and fails to hold together as a cohesive paragraph. Perhaps some of these shortcomings are due to the application having been created for newspaper text rather than academic journals or documents. However, it is interesting to note SweSum s extracted summary manages to capture the phrase, Sometimes such a machine is described as having free will an essential element of Turing s essay. Many variables can affect the outcome of the automatic summarization process, and additional testing would help to yield a more descriptive and comprehensive outcome. Although the summarization attempt yielded less than optimal results, a potentially better outcome could be achieved by either altering variables within SweSum, or using a different summarization tool. KEYPHRASE EXTRACTION Led by Google, vast quantities of books, journals, and articles have been digitized and assembled into a searchable repository of human knowledge, suggesting near limitless quantities of information can be retrieved by a single keystroke. However, scrolling through vast lists of search results does not necessarily equate to a pleasant user experience. The use of keywords as a filter or prism limiting the spectrum of relevant documentation to manageable proportions can serve as an important tool in the fight against information overload [13]. Keywords can be used as descriptions of documents for retrieval: as a way of browsing a collection (e.g. Flickr, De.lic.ious, etc.), as an entry point into a document, as a 5

6 visual emphasis of important phrases, and as a means of measuring document similarity [5]. Once accomplished manually by human experts; automatic keyword extraction techniques select terms indicative of the text s subject matter from within the body of the document, returning a list of the most relevant words or phrases much more rapidly than previous methods [27]. Unfortunately, current extraction methods tend to be brute force techniques, selecting keywords without the benefit of context or nuance [13]. While not yet optimal, automatic keyword extraction techniques can effectively reduce vast quantities of text to a small number of significant terms or phrases much faster and easier than their human counterparts. Automatic keyword extraction methods generally employ a statistical, linguistic, or combination approach to the task of selecting appropriate words or phrases from a document. Well-known examples include: KEA; an algorithm extracting statistically-significant keyphrases from text based on parameters obtained from a training set of existing documents, and WordNet; a free webbased concordance from Princeton University that employs lexical techniques such as sentence structure, and part-of-speech tags, to algorithmically predict potential keywords [14]. Common among almost all current keyword extraction techniques are the use of domain-specific word training models, and the term frequency-inverse document frequency (TF*IDF) equation as the method of identifying the most significant key terms [14]. TF*IDF ranks words in a document by taking the inverse proportion of the frequency of the word s occurrence within the document to the percentage of word occurrence within a specified document corpus; a high TF*IDF score indicating a probable relevant keyword [29]. The frequency of specific words or phrases found within a document can indicate subject matter or meaning. Unfortunately, the ability of a text cloud to communicate document meaning remains limited by the cloud format itself; causing words to be displayed outside of the context of their own occurrence or existence. Garrett labels the condition of extracted keywords without the benefit of context as the Frankenstein Fallacy, arguing, You pull a beating heart out of a body and put it somewhere else, and indeed, it still is the heart, yet in any meaningful way it is the heart no longer. Further, we are staking much of the future of textual analysis on the results of a relentless, almost instantaneous, but ultimately dumb process performed by machines [13]. Garret concludes, by quoting Deborah Friedell, in a 2005 New York Times article entitled The Word Crunchers : While Amazon s concordance can show us the frequency of the words day and shall in Whitman, contain and multitudes don t make the top 100. Neither does be in Hamlet, nor damn in Gone with the Wind. The force of these words goes undetected by even the most powerful computers [13]. Rather than employ the same dumb (frequency-ofoccurrence) method to generate single keywords, an attempt was made to add extra contextual meaning by extracting a document s key multi-word phrases for insertion into the CloudMine text cloud visualization. In theory, keyphrases made of multiple terms could provide additional contextual background lacking in single-term keywords, lessening the effect of Garrett s Frankenstein Fallacy, and aiding information search and retrieval tasks. In order to provide multi-word terms for the CloudMine prototype visualization, the Termine Web Demonstrator, a free, online, automatic text extraction tool was chosen because it is domain independent, incorporates statistical and linguistic methods, contextual information, and most importantly, considers multi-word terms as opposed to one-word keywords [11]. Conceptually, CloudMine s text cloud visualization will provide additional meaning to users through the display of keyphrases extracted from the document text. Instead of relying solely on TF*IDF to identify keywords, Termine employs both linguistic and statistical techniques to indicate term significance by calculating a candidate string s level of termhood (C-value); or the likelihood a term is significant enough to be considered a keyword. Keyphrases are extracted linguistically by scanning text with a part-of-speech (POS) tagger that determines each word s grammatical value and assigns tags identifying it as a noun, verb, etc. A linguistic filter is then applied, limiting probable keyphrases to only those with acceptable part-of-speech phrase combinations. Lastly, a stop-list, excludes word phrases containing common or unsuitable terms; and a list of potential candidate strings is compiled for statistical evaluation [11]. C-value(a) = log 2 a *f(a) if a is not nested (When a is a substring of b, we refer to a as nested and b as a s nesting string.) C-value(a) = log 2 a *(f(a) 1/p(Ta)*sum(f(b))) if a is nested a = candidate string (eg, failure ) b = nesting strings (eg, heart failure ) a = length (number of words) of a f(a) = frequency of a in the corpus Ta = set of b that contain a P(Ta) = number of b in Ta f(b) = frequency of b in the corpus Table 4: The C-value algorithm calculates the likelihood of term significance [10]. 6

7 Termine assigns a value (C-value) to a candidate string by calculating the overall highest-value key terms found within the document; measuring its length and frequency, counting the number of times it occurs as part of longer multi-word terms, and the total number of those multiword terms. An additional value, the NC-value can also be applied to the C-value terms, re-ranking them based on a weighted value determined by the frequency of common terms occurring in context of the contents of the document [11]. Using the C-value to extract high-ranking keyphrases from text adds additional contextual information to the process of automatic term extraction. Tag the corpus; extract strings using linguistic filter; remove tags from strings; remove strings below frequency threshold; filter rest of strings through stop-list; for all strings a of maximum length Calculate C-value(a) = log 2 a f (a); if C-value(a) Threshold add a to output list; for all substrings b revise t(b); revise c(b); for all smaller strings a in descending order if a appears for the first time C-value(a) = log 2 a f (a); else 1 C-value(a) = log 2 a f (a) t( a) c( a) if C-value(a)Threshold add a to output list; for all substrings b revise t(b); revise c(b); Table 5: Termine s multi-term extraction process uses the C-value to extract important keyphrases [11]. Turing s Computing Machinery & Intelligence text was uploaded to Termine s Web Demonstrator and the highest-ranking keyphrases extracted. Analysis of the returned keyphrases reveals the identification of a number of multi-word terms with specific meaning to Turing and his well-known essay. Meaningful and Turing-specific keyphrases included: discrete-state machine, imitation game, lady lovelace, manchester machine, and differential analyzer. Turing s test called the imitation game was proposed to evaluate if computers can think, and is a key aspect of his famous essay. Likewise, the other terms mentioned are also well-known and significant phrases in association with the essay. Additional computer-related and generally meaningful terms identified include: digital computer, human computer, and storage capacity. The ability of Termine to identify and extract related multi-term keyphrases from Turing s paper adds contextual meaning and specificity to particularly relevant phrases found within the text. A comparison of similar terms identified as significant by both C-value and TF*IDF methods demonstrates the expressiveness of multi-word terms, and their value in aiding full-text search. Consider the difference in meaning between the multi-word keyphrase discrete-state machine, extracted by Termine and discrete-state, extracted by the TagCrowd website s keyword generator. A Google search on discrete-state machine returns three articles on Alan Turing within the top five search results, while the same search on discrete-state alone shows no mention of Turing or his paper in the first twenty results. Further Google searches on multi-word keyphrases such as manchester machine or lady lovelace also returned relevant results related to Turing, while a search using one-word keywords extracted by TagCrowd failed to return further information. Further research is needed to discover the optimal length for extracted multi-word terms, and to evaluate the effectiveness of current extraction techniques. Rank Term Score 1 digital computer 32 2 discrete-state machine 18 3 imitation game 16 4 storage capacity 10 5 human computer 9 6 child machine 6 6 scientific induction 6 8 analytical engine 5 8 logical system 5 8 lady lovelace 5 8 well-established fact 5 12 subject matter 4 12 manchester machine 4 12 differential analyser 4 12 random element 4 12 nervous system 4 Table 6: Termine s top multi-term keyphrases extracted from Turing s Computing Machinery & Intelligence. 7

8 Figure 6: A text cloud created by TagCrowd; displays the most frequent keywords extracted from Turing s paper [25]. CONCLUSION This paper has described a proposal for the development of a novel document visualization tool named CloudMine. Employing an array of data-mining techniques to text, CloudMine would display the results in a text cloud format, giving users a sense of document meaning at-aglance, and aiding in the task of information search and retrieval. CloudMine provides needed context to the display of information through automatic categorization, summarization, and multi-term extraction methods that give users a virtual roadmap to the landscape of meaning found within documents. A key point, as demonstrated by the results comparing multi-term keyphrases with single-term keywords is the importance of communicating both context and specificity to the user. While additional study is needed, preliminary results suggest developing CloudMine may be instructive to users; aiding visualization and interpretation of document meaning for rapid understanding. Figure 7: A multi-term text cloud created by CloudMine; shows the most important keyphrases extracted from Turing s Computing Machinery and Intelligence [26]. Figure 8: The CloudMine demo results page displays extracted keyphrases in a text cloud, suggested categories, and a summary; aiding document interpretation and understanding. 8

9 REFERENCES [1] Amitay, E. and Paris, C. (2000). Automatically summarising Web sites: is there a way around it?. In Proceedings of the Ninth international Conference on information and Knowledge Management (McLean, Virginia, United States, 2000). < [2] Blumberg, R. and Atre, S. (2003). Automatic Classification: Moving to the Mainstream. DM Review Magazine, April 2003, March < [3] Buyukkokten, O., Garcia-Molina, H. and Paepcke, A. Seeing the Whole in Parts: Text Summarization for Web Browsing on Handheld Devices. In Procs. of the Tenth International World-Wide Web Conference, < [4] Cavnar, William B. and John M. Trenkle (1994). N- Gram Based Text Categorization. In Procs. of the Third Annual Symposium on Document Analysis and Information Retrieval, April < ork/sdair-94-bc.pdf>. [5] D'Avanzo, E. and Magnini, B. (2005). A Keyphrase- Based Approach to Summarization: The Lake System at DUC In Proc. at the Document Understanding Workshop, October 9-10, 2005, Vancouver, B.C., Canada. < [6] Dalianis, H. (2000). SweSum a text summarizer for Swedish. Published report, October <ftp://ftp.nada.kth.se/iplab/techreports/iplab- 174.pdf>. [7] de Smedt, K., Liseth, A., Hassel, M. and Dalianis, H. (2005). How short is good? An evaluation of automatic summarization. In Holmboe, H. (ed.) Nordisk Sprogteknologi Årbog for Nordisk Språkteknologisk Forskningsprogram < [8] Don, A., Zheleva, E., Machon, G., Tarkan, S., Auvil, L., Clement, T., Schneiderman, B. and Plaisant, C. (2007). Discovering Interesting Usage Patterns in Text Collections: Integrating Text Mining with Visualization, in Proc. of the 16th ACM conference on Conference on Information and Knowledge Management, Lisbon, Portugal < [9] Dorre, J., Gerstl, P. and Seiffert, R. (99). Text Mining: Finding Nuggets in Mountains of Textual Data, in Proc. of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, < [10] Fahmi, I. (2005). C-value method for multi-word term extraction. Lecture for Seminar in Statistics and Methodology, Alfa-informatica, RuG, May 23, < [11] Frantzi, K., Ananiadou, S. and Mima, H. (2000). Automatic recognition of multi-word terms:. the C- value/nc-value method. International Journal on Digital Libraries, Vol. V3, No. 2. (2000), pp < adou/ijodl2000.pdf>. [12] Frid-Jiminez, Amber. "Semantic Landscape." Amber Frid-Jimenez. MIT Media Lab. 18 Sept < 1-3.html>. [13] Garrett, J. (2006). KWIC and Dirty? Human Cognition and the Claims of Full-Text Searching. Ann Arbor, MI: Scholarly Publishing Office, University of Michigan, University Library, vol. 9, no. 1, Winter < [14] Giarlo, M. (2006). A Comparative Analysis of Keyword Extraction Techniques. Unpublished paper. 16 Feb < [15] Hassan-Montero, Y., and Herrero-Solana, V. (2006). Improving Tag-Clouds as Visual Information Retrieval Interfaces, in Proc. of the 1 st International Conference on Multidisciplinary Information Sciences and Technologies (InSCiT2006), Merida, Spain, October 23-28, < clouds.pdf>. [16] Hassel, M. (2004). Automatic Text Summarization. NADA-IPLab presentation. 16 May < 05txtsum_ho.pdf>. [17] Jones, S., Lundy, S. and Paynter, G. (2002). Interactive Document Summarization using Automatically Extracted Keyphrases. In Procs. of the 35 th Annual Hawaii International Conference on System Sciences (HICSS-35 02), < DFdocuments/DDUAC04.pdf>. 9

10 [18] Kuo, B., Hentrich, T., Good, B. and Wilkinson, M. (2007). Tag Clouds for Summarizing Web Search Results, in Proc. of the 16th International Conference on the World Wide Web, Banff, Alberta, Canada, < [19] Labrou, Y. and Finin, T. (1999). Yahoo! as an ontology: using Yahoo! categories to describe documents. In Proceedings of the Eighth international Conference on information and Knowledge Management, Kansas City, Missouri, United States, < [20] Lamantia, J. (2006). Tag Clouds: Navigation for Landscapes of Meaning. Joe Lamantia Blog. 16 May < uds_navigation_for_landscapes_of_meaning.html>. [23] Mathes, A. (2004). Folksonomies - Cooperative Classification and Communication through Shared Metadata. Unpublished paper. 16 May < < poster.pdf>. [28] Wikipedia (2007). N-Gram definition. < [29] Wikipedia (2007). TF-IDF definition. < [30] Wu, H., Zubair, M. and Maly, K. (2007). Collaborative Classification of Growing Collections with Evolving Facets, in Proc. of the 18th conference on Hypertext and hypermedia, Manchester, UK, < >. [31] Yahoo! Directory. 14 Apr < [21] Lamantia, J. (2006). Text Clouds: A new form of Tag Cloud? Joe Lamantia Blog. 16 May < xt_clouds_a_new_form_of_tag_cloud.html>. [22] Liu, H., Selker, T. and Lieberman, H. (2003). Visualizing the Affective Structure of a Text Document. In Proc. of the Conference on Human Factors in Computing Systems (CHI 03), Ft. Lauderdale, FL, USA, < g-affective.pdf>. [24] Mladenic, D. (1998). Turning Yahoo into an Automatic Web-Page Classifier, in Proc. of the 13th European Conference on Aritficial Intelligence ECAI'98 (pp ). < [25] Steinbock, D. (2006). TagCrowd: Create your own tag cloud from any text. 22 Nov < [26] Watters, D. (2008). CloudMine: Demonstrating a novel interface for text visualization. Unpublished demonstration. 09 June < ne_demo.htm>. [27] Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C., and Nevill-Manning, C. G. (1999). KEA: practical automatic keyphrase extraction. In Proceedings of the Fourth ACM Conference on Digital Libraries (Berkeley, California, United States, August 11-14, 1999). 10