MEANINGFUL CLOUDS: TOWARDS A NOVEL INTERFACE FOR DOCUMENT VISUALIZATION

Size: px
Start display at page:

Download "MEANINGFUL CLOUDS: TOWARDS A NOVEL INTERFACE FOR DOCUMENT VISUALIZATION"

Transcription

1 MEANINGFUL CLOUDS: TOWARDS A NOVEL INTERFACE FOR DOCUMENT VISUALIZATION Dan Watters DePaul University Chicago, IL USA iamdanwatters@yahoo.com ABSTRACT This paper explores text clouds as a means of semantically visualizing a document and proposes development of a tool to extract and display contextual data from text, helping users perceive the meaning of a document at a glance, and speeding the task of search result summary evaluation. KEY WORDS Tag Cloud, Text Cloud, Folksonomy, Data Visualization, Text Mining, Automatic Summarization, Categorization, Term Extraction INTRODUCTION Popular Web 2.0 social bookmarking websites such as Flickr, Delicious, and Connotea, apply user-generated keywords or tags in a flat, non-hierarchical manner known as folksonomy to their collective content, displaying the most popular tags in a tag cloud visualization providing additional contextual meaning and findability. The term folksonomy was originated in 2004 by Thomas Vander Wal, combining folk with taxonomy, and represents a bottom-up categorization of terms based on community consensus rather than the top-down hierarchical taxonomy commonly employed by traditional library scientists [23]. A tag cloud is a list of tags arranged visually, with added meaning through the use of contrasting size, weight, and color of specific navigable text labels based on each tag s frequency-ofoccurrence within the group collective [15]. Accordingly, the more popular a tag term becomes, the greater the size and weight of the corresponding label displayed in the tag cloud visualization. Tag clouds therefore, provide a summary or semantic view, of the most commonly-used collective concepts generated by users for a particular subject or category [20]. Figure 1: A tag cloud showing Flickr s all time most popular tags demonstrates tag size based on frequency-of-occuence within the collective group. Although the text cloud is identical in appearance to a tag cloud, it differs in function; and is used primarily as an aid in the analysis and comprehension of bodies of text. Rather than showing user-tagged labels representing collective content, text clouds display an automated representation of the most frequently-occurring keywords within a particular document or corpus, acting as a data structure or executive summary on steroids [21]. The visual display of keywords in a text cloud acts much like an unstructured table of contents or outline, allowing users to quickly gain a sense of the document s major themes or meaning [22]. Current uses for text clouds include summarizing abstracts returned from search queries to a biomedical database (PubCloud), and providing visual summaries of documents uploaded from the Web or other sources (TagCrowd) [18][21]. Text clouds illustrate data in a visually appealing and elegant manner. Steinbock states, When we look at a text cloud, we see not only an informative, beautiful image that communicates much in a single glance; we see a whole new perspective on text [25]. 1

2 cognitive load often associated with information retrieval tasks on the Web. Figure 2: The text cloud of Lincoln s Gettysburg Address, created by TagCrowd, aids information retrieval by displaying key terms found within the text. However, a text cloud utilizing term frequency-ofoccurrence as the sole measure of a document s meaning presents a limited approach that could gain from integrating additional text mining and visualization functions to provide richer data capabilities to users seeking to interpret and evaluate the contents of a document [8]. This paper describes a conceptual vision, presents examples, and provides recommendations for development of a prototype which would apply automatic categorization, summarization, and keyphrase extraction functions to a document or text. The results would be displayed in a text cloud, providing users a visual reference for evaluation, and aiding the task of information search and retrieval. LANDSCAPES OF MEANING Before visiting an unfamiliar location, a common practice is to consult a map for directions and familiarity with the landscape to be traveled. Apply the mental model of a document as unexplored territory, and the prototype to be developed as a roadmap providing context and orientation, and it becomes apparent a document can be viewed as a landscape of meaning. Consider a set of points of interest, of peaks and valleys each peak representing a focal node within the document i.e. the most important extracted keyphrases and corresponding content. The topographical depictions of semantic landscapes created by Frid-Jiminez offer further inspiration regarding meaningful visualization of data from a document or text [?]. Google Maps provides another analogy. The proposed prototype to be developed - called CloudMine, would apply the following text mining functions to a body of text and display a visual overview in the form of a text cloud, providing the context, background, and key points needed for users to quickly interpret a document's essential content or meaning. Automatic categorization would provide a contextual frame of reference such as a particular nation, state, or geographic location. Summarization presents an overview, or in our map analogy the lay of the land depicting a view of the map. Lastly, extracted keyphrases can be compared to geographic points-of-interest such as restaurants, gas stations, or a particular address. The goal of CloudMine, is to provide users with an at-a-glance interpretation of the essential meaning of a document so as to reduce the Figure 3: Conceptual model for CloudMine. A roadmap for users intended to visualize and interpret a document s key elements and meaning [26]. CATEGORIZATION Automatic categorization assigns documents to an appropriate directory within a taxonomy based on machine processing and evaluation rather than human experts. A taxonomy, or hierarchical classification system, organizes unstructured information into an ordered, logical structure increasing findability [2]. Some categorization tools (classifiers) are already being used to classify documents for information retrieval within proprietary and academic taxonomies [9]. Algorithms with differing approaches, such as naïve Bayesian, k- nearest neighbor (knn), and support vector machines (SVM) are used to calculate the statistical probability a document belongs in a particular category [2]. Unfortunately, algorithms directing a document to a category based on mathematical probability or programmatic rules are prone to error because machines do not understand the concepts required for interpretation in the same manner as humans [2]. While human expert catalogers currently remain the standard, automatic categorization is helpful interpreting vast document repositories or for situations where human resources are not available or economically feasible. Conceptually, the CloudMine prototype would evaluate the content of a body of text and return a high-level category term from an appropriate taxonomy describing similar documents, giving users a sense of context, while providing information to be used as metadata for automatic retrieval via keywords or tags [30]. Although CloudMine could be used to visualize text within a relatively constrained proprietary or academic document management system, the intent is for CloudMine to be able to evaluate and categorize documents found on the Web. Consequently, the Web itself can act as a resource for the classification schema. 2

3 Yahoo! Directory is an existing web repository of a hierarchical taxonomy that can be used as the basis for the automatic assignment of categories to a document text [24]. Using a representative sample of web documents as a training corpus, Labrou and Finin, used Telltale, a classifier using an n-gram algorithm to automatically assign documents to Yahoo! Directory categories [19]. According to Cavnar and Trenkle, n-gram frequency is an effective method for categorizing documents in an unstructured and variable environment and performs well [4]. An n-gram is a small sub-set or sequence of n items from a string of text, usually either text characters or words, such that based on a training set of data, and a given text sequence ( x, the i 1, xi 2,..., xi n) probability P( x of the next character i xi 1, xi 2,..., xi n) or word is predicted [28]. A set of n-grams can be represented as a histogram which can be used as a profile for comparing or matching documents to a particular category by calculating the measure of the distance (or the number of standard deviations from the mean occurrence), and assigning the document to the category with the smallest corresponding distance [4]. Using Yahoo! Directory as the taxonomy, and an n-gram classifier, CloudMine could be developed to evaluate document text and provide appropriate category suggestions. Figure 4: The n-gram categorization process [4]. Top level category Arts & Humanities Business & Economy Computers & Internet Education Entertainment Government Health News & Media Recreation & Sports Reference Regional Science Social Science Society & Culture Sub-Category / Classifier keyword Photography, History, Literature... B2B, Finance, Shopping, Jobs... Hardware, Software, Web, Games... Colleges, K-12, Distance Learning... Movies, TV Shows, Music, Humor... Elections, Military, Law, Taxes... Diseases, Drugs, Fitness, Nutrition... Newspapers, Radio, Weather, Blogs... Sports, Travel, Autos, Outdoors... Phone Numbers, Dictionaries, Quotes... Countries, Regions, U.S. States... Animals, Astronomy, Earth Science... Languages, Archaeology, Psychology... Sexuality, Religion, Food & Drink... While considering the n-gram classification approach as described by Labrou and Finin promising; the purpose of this paper is primarily to explore concepts behind the proposed development of CloudMine, and to suggest promising methods requiring further research. Consequently, implementation of the Telltale classifier on actual document text was considered to be out of scope. For prototyping purposes, an approximation of the previous categorization process was carried out. Nonetheless, the results, shown in Table 2, are instructive. Table 1: The Yahoo! Directory structure [31]. 3

4 Rank Keyword Frequency 1 Machine Computer 61 3 Game 36 4 Digital 35 5 Possible 35 6 Argument 29 7 States 26 8 Interrogator 25 9 Described System Behavior Discrete-state Fact Human Idea Rules Imitation Store Capacity Process 17 Table 2: TagCrowd extraction of the 20 most frequently-occurring terms found within Turing s, Computing Machinery & Intelligence. Using Steinbock s word frequency calculator from TagCrowd, the text of Alan Turing s well-known essay Computing Machinery & Intelligence was parsed and the top 20 keywords identified. Next, the keywords or phrases would need to be be compared by the n-gram algorithm and matched to the closest occurring category. Human inspection reveals three of the top five terms include Machine, Digital, and Computer ; closely identifying the document with the Yahoo! Directory category entitled Computers & Internet. While many documents may not have such clearly identifiable subjects, it seems likely given a reasonable data corpus, that in many cases, an appropriate category or categories could be automatically determined. Although a cursory view of the data returned appears promising, more research and testing are needed to determine the utility and effectiveness of the proposed automatic categorization methods. SUMMARIZATION In spite of considerable progress since the early 60 s, current automatic text summarization tools still do not approach human levels of cognition or knowledge abstraction. However, due to the ever-increasing processing power of the personal computer, large quantities of text can be scanned rapidly and inexpensively; efficiently returning summaries typically based not on an abstraction, or reinterpretation of text, but on extraction of the most relevant segments of the original text, organized into a new text form [16]. Although extraction summarization methods cannot provide the same thoughtful analysis as the human mind, they can quickly and easily return meaning from vast quantities of text. A key component of the conceptual vision of CloudMine is the ability to provide short document summaries. Searching through vast amounts of information on the Web is both time and labor-intensive. Document summaries are useful as a view or sense of a text s meaning [17]. Additionally, research has shown web pages and full-text documents are difficult to download, browse, and use for wireless, mobile, and handheld devices [3]. CloudMine could conceivably serve as a tool providing quick, easily-viewable document summaries, accessed through the Web via wired or wireless applications. CloudMine conceptually summarizes not only a document, but potentially any document found on the Web; a highly variable repository, as opposed to a limited corpus such as a collection of scholarly works or a corporate intranet database. However, traditional text extraction methods have focused on using algorithms to extract key phrases from text by creating a model of appropriate keywords gleaned from an established training set of documents, and then applying the model to related new documents as a means of assigning a comparative value to found key words or phrases [17]. The variability of documents found on the Web presents a unique challenge to the task of automatic summarization, requiring new methods, currently being researched that can among others: Extract key phrases from an unrestricted set of documents. Operate without the need for training documents. Return results quickly. Account for images or other objects found within the document [1]. One promising method for summarizing Web documents, known as sentence-based abstraction, pulls important sentences from document text, forming summaries based on preferred terms such as titles, bold text, and repeated phrases; assembling snippets of information into a comprehensive whole [1]. Although traditional automatic text summarization techniques generally involve comparing a text corpus to a model based on a predefined training set, new research tools are being developed that account for the variable nature of Web documents. SweSum, a free online summarizer originally for Swedish text, was chosen to simulate CloudMine s summarization process. SweSum, uses a 700,000 word dictionary to identify keywords, extracting key sentences from a document based on scores determined by parameters including term frequency (tf), sentence positioning, 4

5 numeric values, and bold or title text, in order to create a condensed summary of the original text [6]. Frequent terms found within the text and matched against the tool s dictionary are considered to be more important, thus earning a higher overall score [6]. Term frequency represents the amount of times a term is found within a document divided by the number of total words, and can be calculated for term t i within a document d j as: ni, j tf i, j [29]. n = k k, j Figure 5: The SweSum automatic summarization process [7]. Baseline Title Term Frequency Position score Sentence length Average lexical connectivity Numerical data Sentence order in text gives the importance of the sentences. First sentence highest ranking - last sentence lowest ranking. Words in the title and immediately following sentences are given a higher score. Open class terms that are frequent in the text are considered to be more important than less frequent terms. The assumption is that certain genres put important sentences in fixed positions. For example, newspaper articles usually have the most important terms in the 4 first paragraphs. Reports on the other hand have many important sentences at the end of the text. The sentence length implies which sentence is the most important. Number terms shared with other sentences. The assumption is that a sentence that shares more terms with other sentences is more important. Sentences containing numerical data are scored higher than the ones without numerical values. All the above parameters are normalized and put in a simple combination function with modifiable weighting. The idea is that high scoring sentences in the original text are kept in the summary. Table 3: Extracted text summarized by SweSum is weighted by the above parameters [7]. Turing s Computing Machinery & Intelligence was uploaded to SweSum s online interface for processing, and the results used as an example of automatic summarization for the CloudMine prototype. Out of a total document word count of 11,750 words, the shortest allowable (1%) summary extraction yielded the following 123 word summary: An interesting variant on the idea of a digital computer is a "digital computer with a random element." These have instructions involving the throwing of a die or some equivalent electronic process; one such instruction might for instance be, "Throw the die and put the-resulting number into store 1000." Sometimes such a machine is described as having free will (though I would not use this phrase myself), It is not normally possible to determine from observing a machine whether it has a random element, for a similar effect can be produced by such devices as making the choices depend on the digits of the decimal for. This machine could be described abstractly as follows. (It would not be a discrete-state machine however [Summary extraction by SweSum]. Ideally, an effective summary should communicate the essence of a document as succinctly as possible. SweSum s summary seems longer than necessary, and fails to hold together as a cohesive paragraph. Perhaps some of these shortcomings are due to the application having been created for newspaper text rather than academic journals or documents. However, it is interesting to note SweSum s extracted summary manages to capture the phrase, Sometimes such a machine is described as having free will an essential element of Turing s essay. Many variables can affect the outcome of the automatic summarization process, and additional testing would help to yield a more descriptive and comprehensive outcome. Although the summarization attempt yielded less than optimal results, a potentially better outcome could be achieved by either altering variables within SweSum, or using a different summarization tool. KEYPHRASE EXTRACTION Led by Google, vast quantities of books, journals, and articles have been digitized and assembled into a searchable repository of human knowledge, suggesting near limitless quantities of information can be retrieved by a single keystroke. However, scrolling through vast lists of search results does not necessarily equate to a pleasant user experience. The use of keywords as a filter or prism limiting the spectrum of relevant documentation to manageable proportions can serve as an important tool in the fight against information overload [13]. Keywords can be used as descriptions of documents for retrieval: as a way of browsing a collection (e.g. Flickr, De.lic.ious, etc.), as an entry point into a document, as a 5

6 visual emphasis of important phrases, and as a means of measuring document similarity [5]. Once accomplished manually by human experts; automatic keyword extraction techniques select terms indicative of the text s subject matter from within the body of the document, returning a list of the most relevant words or phrases much more rapidly than previous methods [27]. Unfortunately, current extraction methods tend to be brute force techniques, selecting keywords without the benefit of context or nuance [13]. While not yet optimal, automatic keyword extraction techniques can effectively reduce vast quantities of text to a small number of significant terms or phrases much faster and easier than their human counterparts. Automatic keyword extraction methods generally employ a statistical, linguistic, or combination approach to the task of selecting appropriate words or phrases from a document. Well-known examples include: KEA; an algorithm extracting statistically-significant keyphrases from text based on parameters obtained from a training set of existing documents, and WordNet; a free webbased concordance from Princeton University that employs lexical techniques such as sentence structure, and part-of-speech tags, to algorithmically predict potential keywords [14]. Common among almost all current keyword extraction techniques are the use of domain-specific word training models, and the term frequency-inverse document frequency (TF*IDF) equation as the method of identifying the most significant key terms [14]. TF*IDF ranks words in a document by taking the inverse proportion of the frequency of the word s occurrence within the document to the percentage of word occurrence within a specified document corpus; a high TF*IDF score indicating a probable relevant keyword [29]. The frequency of specific words or phrases found within a document can indicate subject matter or meaning. Unfortunately, the ability of a text cloud to communicate document meaning remains limited by the cloud format itself; causing words to be displayed outside of the context of their own occurrence or existence. Garrett labels the condition of extracted keywords without the benefit of context as the Frankenstein Fallacy, arguing, You pull a beating heart out of a body and put it somewhere else, and indeed, it still is the heart, yet in any meaningful way it is the heart no longer. Further, we are staking much of the future of textual analysis on the results of a relentless, almost instantaneous, but ultimately dumb process performed by machines [13]. Garret concludes, by quoting Deborah Friedell, in a 2005 New York Times article entitled The Word Crunchers : While Amazon s concordance can show us the frequency of the words day and shall in Whitman, contain and multitudes don t make the top 100. Neither does be in Hamlet, nor damn in Gone with the Wind. The force of these words goes undetected by even the most powerful computers [13]. Rather than employ the same dumb (frequency-ofoccurrence) method to generate single keywords, an attempt was made to add extra contextual meaning by extracting a document s key multi-word phrases for insertion into the CloudMine text cloud visualization. In theory, keyphrases made of multiple terms could provide additional contextual background lacking in single-term keywords, lessening the effect of Garrett s Frankenstein Fallacy, and aiding information search and retrieval tasks. In order to provide multi-word terms for the CloudMine prototype visualization, the Termine Web Demonstrator, a free, online, automatic text extraction tool was chosen because it is domain independent, incorporates statistical and linguistic methods, contextual information, and most importantly, considers multi-word terms as opposed to one-word keywords [11]. Conceptually, CloudMine s text cloud visualization will provide additional meaning to users through the display of keyphrases extracted from the document text. Instead of relying solely on TF*IDF to identify keywords, Termine employs both linguistic and statistical techniques to indicate term significance by calculating a candidate string s level of termhood (C-value); or the likelihood a term is significant enough to be considered a keyword. Keyphrases are extracted linguistically by scanning text with a part-of-speech (POS) tagger that determines each word s grammatical value and assigns tags identifying it as a noun, verb, etc. A linguistic filter is then applied, limiting probable keyphrases to only those with acceptable part-of-speech phrase combinations. Lastly, a stop-list, excludes word phrases containing common or unsuitable terms; and a list of potential candidate strings is compiled for statistical evaluation [11]. C-value(a) = log 2 a *f(a) if a is not nested (When a is a substring of b, we refer to a as nested and b as a s nesting string.) C-value(a) = log 2 a *(f(a) 1/p(Ta)*sum(f(b))) if a is nested a = candidate string (eg, failure ) b = nesting strings (eg, heart failure ) a = length (number of words) of a f(a) = frequency of a in the corpus Ta = set of b that contain a P(Ta) = number of b in Ta f(b) = frequency of b in the corpus Table 4: The C-value algorithm calculates the likelihood of term significance [10]. 6

7 Termine assigns a value (C-value) to a candidate string by calculating the overall highest-value key terms found within the document; measuring its length and frequency, counting the number of times it occurs as part of longer multi-word terms, and the total number of those multiword terms. An additional value, the NC-value can also be applied to the C-value terms, re-ranking them based on a weighted value determined by the frequency of common terms occurring in context of the contents of the document [11]. Using the C-value to extract high-ranking keyphrases from text adds additional contextual information to the process of automatic term extraction. Tag the corpus; extract strings using linguistic filter; remove tags from strings; remove strings below frequency threshold; filter rest of strings through stop-list; for all strings a of maximum length Calculate C-value(a) = log 2 a f (a); if C-value(a) Threshold add a to output list; for all substrings b revise t(b); revise c(b); for all smaller strings a in descending order if a appears for the first time C-value(a) = log 2 a f (a); else 1 C-value(a) = log 2 a f (a) t( a) c( a) if C-value(a)Threshold add a to output list; for all substrings b revise t(b); revise c(b); Table 5: Termine s multi-term extraction process uses the C-value to extract important keyphrases [11]. Turing s Computing Machinery & Intelligence text was uploaded to Termine s Web Demonstrator and the highest-ranking keyphrases extracted. Analysis of the returned keyphrases reveals the identification of a number of multi-word terms with specific meaning to Turing and his well-known essay. Meaningful and Turing-specific keyphrases included: discrete-state machine, imitation game, lady lovelace, manchester machine, and differential analyzer. Turing s test called the imitation game was proposed to evaluate if computers can think, and is a key aspect of his famous essay. Likewise, the other terms mentioned are also well-known and significant phrases in association with the essay. Additional computer-related and generally meaningful terms identified include: digital computer, human computer, and storage capacity. The ability of Termine to identify and extract related multi-term keyphrases from Turing s paper adds contextual meaning and specificity to particularly relevant phrases found within the text. A comparison of similar terms identified as significant by both C-value and TF*IDF methods demonstrates the expressiveness of multi-word terms, and their value in aiding full-text search. Consider the difference in meaning between the multi-word keyphrase discrete-state machine, extracted by Termine and discrete-state, extracted by the TagCrowd website s keyword generator. A Google search on discrete-state machine returns three articles on Alan Turing within the top five search results, while the same search on discrete-state alone shows no mention of Turing or his paper in the first twenty results. Further Google searches on multi-word keyphrases such as manchester machine or lady lovelace also returned relevant results related to Turing, while a search using one-word keywords extracted by TagCrowd failed to return further information. Further research is needed to discover the optimal length for extracted multi-word terms, and to evaluate the effectiveness of current extraction techniques. Rank Term Score 1 digital computer 32 2 discrete-state machine 18 3 imitation game 16 4 storage capacity 10 5 human computer 9 6 child machine 6 6 scientific induction 6 8 analytical engine 5 8 logical system 5 8 lady lovelace 5 8 well-established fact 5 12 subject matter 4 12 manchester machine 4 12 differential analyser 4 12 random element 4 12 nervous system 4 Table 6: Termine s top multi-term keyphrases extracted from Turing s Computing Machinery & Intelligence. 7

8 Figure 6: A text cloud created by TagCrowd; displays the most frequent keywords extracted from Turing s paper [25]. CONCLUSION This paper has described a proposal for the development of a novel document visualization tool named CloudMine. Employing an array of data-mining techniques to text, CloudMine would display the results in a text cloud format, giving users a sense of document meaning at-aglance, and aiding in the task of information search and retrieval. CloudMine provides needed context to the display of information through automatic categorization, summarization, and multi-term extraction methods that give users a virtual roadmap to the landscape of meaning found within documents. A key point, as demonstrated by the results comparing multi-term keyphrases with single-term keywords is the importance of communicating both context and specificity to the user. While additional study is needed, preliminary results suggest developing CloudMine may be instructive to users; aiding visualization and interpretation of document meaning for rapid understanding. Figure 7: A multi-term text cloud created by CloudMine; shows the most important keyphrases extracted from Turing s Computing Machinery and Intelligence [26]. Figure 8: The CloudMine demo results page displays extracted keyphrases in a text cloud, suggested categories, and a summary; aiding document interpretation and understanding. 8

9 REFERENCES [1] Amitay, E. and Paris, C. (2000). Automatically summarising Web sites: is there a way around it?. In Proceedings of the Ninth international Conference on information and Knowledge Management (McLean, Virginia, United States, 2000). < [2] Blumberg, R. and Atre, S. (2003). Automatic Classification: Moving to the Mainstream. DM Review Magazine, April 2003, March < [3] Buyukkokten, O., Garcia-Molina, H. and Paepcke, A. Seeing the Whole in Parts: Text Summarization for Web Browsing on Handheld Devices. In Procs. of the Tenth International World-Wide Web Conference, < [4] Cavnar, William B. and John M. Trenkle (1994). N- Gram Based Text Categorization. In Procs. of the Third Annual Symposium on Document Analysis and Information Retrieval, April < ork/sdair-94-bc.pdf>. [5] D'Avanzo, E. and Magnini, B. (2005). A Keyphrase- Based Approach to Summarization: The Lake System at DUC In Proc. at the Document Understanding Workshop, October 9-10, 2005, Vancouver, B.C., Canada. < [6] Dalianis, H. (2000). SweSum a text summarizer for Swedish. Published report, October <ftp://ftp.nada.kth.se/iplab/techreports/iplab- 174.pdf>. [7] de Smedt, K., Liseth, A., Hassel, M. and Dalianis, H. (2005). How short is good? An evaluation of automatic summarization. In Holmboe, H. (ed.) Nordisk Sprogteknologi Årbog for Nordisk Språkteknologisk Forskningsprogram < [8] Don, A., Zheleva, E., Machon, G., Tarkan, S., Auvil, L., Clement, T., Schneiderman, B. and Plaisant, C. (2007). Discovering Interesting Usage Patterns in Text Collections: Integrating Text Mining with Visualization, in Proc. of the 16th ACM conference on Conference on Information and Knowledge Management, Lisbon, Portugal < [9] Dorre, J., Gerstl, P. and Seiffert, R. (99). Text Mining: Finding Nuggets in Mountains of Textual Data, in Proc. of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, < [10] Fahmi, I. (2005). C-value method for multi-word term extraction. Lecture for Seminar in Statistics and Methodology, Alfa-informatica, RuG, May 23, < [11] Frantzi, K., Ananiadou, S. and Mima, H. (2000). Automatic recognition of multi-word terms:. the C- value/nc-value method. International Journal on Digital Libraries, Vol. V3, No. 2. (2000), pp < adou/ijodl2000.pdf>. [12] Frid-Jiminez, Amber. "Semantic Landscape." Amber Frid-Jimenez. MIT Media Lab. 18 Sept < 1-3.html>. [13] Garrett, J. (2006). KWIC and Dirty? Human Cognition and the Claims of Full-Text Searching. Ann Arbor, MI: Scholarly Publishing Office, University of Michigan, University Library, vol. 9, no. 1, Winter < [14] Giarlo, M. (2006). A Comparative Analysis of Keyword Extraction Techniques. Unpublished paper. 16 Feb < [15] Hassan-Montero, Y., and Herrero-Solana, V. (2006). Improving Tag-Clouds as Visual Information Retrieval Interfaces, in Proc. of the 1 st International Conference on Multidisciplinary Information Sciences and Technologies (InSCiT2006), Merida, Spain, October 23-28, < clouds.pdf>. [16] Hassel, M. (2004). Automatic Text Summarization. NADA-IPLab presentation. 16 May < 05txtsum_ho.pdf>. [17] Jones, S., Lundy, S. and Paynter, G. (2002). Interactive Document Summarization using Automatically Extracted Keyphrases. In Procs. of the 35 th Annual Hawaii International Conference on System Sciences (HICSS-35 02), < DFdocuments/DDUAC04.pdf>. 9

10 [18] Kuo, B., Hentrich, T., Good, B. and Wilkinson, M. (2007). Tag Clouds for Summarizing Web Search Results, in Proc. of the 16th International Conference on the World Wide Web, Banff, Alberta, Canada, < [19] Labrou, Y. and Finin, T. (1999). Yahoo! as an ontology: using Yahoo! categories to describe documents. In Proceedings of the Eighth international Conference on information and Knowledge Management, Kansas City, Missouri, United States, < [20] Lamantia, J. (2006). Tag Clouds: Navigation for Landscapes of Meaning. Joe Lamantia Blog. 16 May < uds_navigation_for_landscapes_of_meaning.html>. [23] Mathes, A. (2004). Folksonomies - Cooperative Classification and Communication through Shared Metadata. Unpublished paper. 16 May < < poster.pdf>. [28] Wikipedia (2007). N-Gram definition. < [29] Wikipedia (2007). TF-IDF definition. < [30] Wu, H., Zubair, M. and Maly, K. (2007). Collaborative Classification of Growing Collections with Evolving Facets, in Proc. of the 18th conference on Hypertext and hypermedia, Manchester, UK, < >. [31] Yahoo! Directory. 14 Apr < [21] Lamantia, J. (2006). Text Clouds: A new form of Tag Cloud? Joe Lamantia Blog. 16 May < xt_clouds_a_new_form_of_tag_cloud.html>. [22] Liu, H., Selker, T. and Lieberman, H. (2003). Visualizing the Affective Structure of a Text Document. In Proc. of the Conference on Human Factors in Computing Systems (CHI 03), Ft. Lauderdale, FL, USA, < g-affective.pdf>. [24] Mladenic, D. (1998). Turning Yahoo into an Automatic Web-Page Classifier, in Proc. of the 13th European Conference on Aritficial Intelligence ECAI'98 (pp ). < [25] Steinbock, D. (2006). TagCrowd: Create your own tag cloud from any text. 22 Nov < [26] Watters, D. (2008). CloudMine: Demonstrating a novel interface for text visualization. Unpublished demonstration. 09 June < ne_demo.htm>. [27] Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C., and Nevill-Manning, C. G. (1999). KEA: practical automatic keyphrase extraction. In Proceedings of the Fourth ACM Conference on Digital Libraries (Berkeley, California, United States, August 11-14, 1999). 10

Folksonomies versus Automatic Keyword Extraction: An Empirical Study

Folksonomies versus Automatic Keyword Extraction: An Empirical Study Folksonomies versus Automatic Keyword Extraction: An Empirical Study Hend S. Al-Khalifa and Hugh C. Davis Learning Technology Research Group, ECS, University of Southampton, Southampton, SO17 1BJ, UK {hsak04r/hcd}@ecs.soton.ac.uk

More information

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it Web Mining Margherita Berardi LACAM Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it Bari, 24 Aprile 2003 Overview Introduction Knowledge discovery from text (Web Content

More information

Term extraction for user profiling: evaluation by the user

Term extraction for user profiling: evaluation by the user Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,

More information

Institute for Information Systems and Computer Media. Graz University of Technology. Phone: (+43) 316-873-5613. Graz University of Technology

Institute for Information Systems and Computer Media. Graz University of Technology. Phone: (+43) 316-873-5613. Graz University of Technology Title: Tag Clouds Name: Christoph Trattner 1 and Denis Helic 2 and Markus Strohmaier 2 Affil./Addr. 1: Knowledge Management Institute and Institute for Information Systems and Computer Media Graz University

More information

EXPLOITING FOLKSONOMIES AND ONTOLOGIES IN AN E-BUSINESS APPLICATION

EXPLOITING FOLKSONOMIES AND ONTOLOGIES IN AN E-BUSINESS APPLICATION EXPLOITING FOLKSONOMIES AND ONTOLOGIES IN AN E-BUSINESS APPLICATION Anna Goy and Diego Magro Dipartimento di Informatica, Università di Torino C. Svizzera, 185, I-10149 Italy ABSTRACT This paper proposes

More information

Comparing Tag Clouds, Term Histograms, and Term Lists for Enhancing Personalized Web Search

Comparing Tag Clouds, Term Histograms, and Term Lists for Enhancing Personalized Web Search Comparing Tag Clouds, Term Histograms, and Term Lists for Enhancing Personalized Web Search Orland Hoeber and Hanze Liu Department of Computer Science, Memorial University St. John s, NL, Canada A1B 3X5

More information

How To Write A Summary Of A Review

How To Write A Summary Of A Review PRODUCT REVIEW RANKING SUMMARIZATION N.P.Vadivukkarasi, Research Scholar, Department of Computer Science, Kongu Arts and Science College, Erode. Dr. B. Jayanthi M.C.A., M.Phil., Ph.D., Associate Professor,

More information

Semantic Concept Based Retrieval of Software Bug Report with Feedback

Semantic Concept Based Retrieval of Software Bug Report with Feedback Semantic Concept Based Retrieval of Software Bug Report with Feedback Tao Zhang, Byungjeong Lee, Hanjoon Kim, Jaeho Lee, Sooyong Kang, and Ilhoon Shin Abstract Mining software bugs provides a way to develop

More information

Collecting Polish German Parallel Corpora in the Internet

Collecting Polish German Parallel Corpora in the Internet Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING Sumit Goswami 1 and Mayank Singh Shishodia 2 1 Indian Institute of Technology-Kharagpur, Kharagpur, India sumit_13@yahoo.com 2 School of Computer

More information

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Mobile Phone APP Software Browsing Behavior using Clustering Analysis Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Mobile Phone APP Software Browsing Behavior using Clustering Analysis

More information

GrammAds: Keyword and Ad Creative Generator for Online Advertising Campaigns

GrammAds: Keyword and Ad Creative Generator for Online Advertising Campaigns GrammAds: Keyword and Ad Creative Generator for Online Advertising Campaigns Stamatina Thomaidou 1,2, Konstantinos Leymonis 1,2, Michalis Vazirgiannis 1,2,3 Presented by: Fragkiskos Malliaros 2 1 : Athens

More information

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Ling-Xiang Tang 1, Andrew Trotman 2, Shlomo Geva 1, Yue Xu 1 1Faculty of Science and Technology, Queensland University

More information

Blog Post Extraction Using Title Finding

Blog Post Extraction Using Title Finding Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School

More information

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework Usha Nandini D 1, Anish Gracias J 2 1 ushaduraisamy@yahoo.co.in 2 anishgracias@gmail.com Abstract A vast amount of assorted

More information

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,

More information

Cloud Storage-based Intelligent Document Archiving for the Management of Big Data

Cloud Storage-based Intelligent Document Archiving for the Management of Big Data Cloud Storage-based Intelligent Document Archiving for the Management of Big Data Keedong Yoo Dept. of Management Information Systems Dankook University Cheonan, Republic of Korea Abstract : The cloud

More information

Mining Text Data: An Introduction

Mining Text Data: An Introduction Bölüm 10. Metin ve WEB Madenciliği http://ceng.gazi.edu.tr/~ozdemir Mining Text Data: An Introduction Data Mining / Knowledge Discovery Structured Data Multimedia Free Text Hypertext HomeLoan ( Frank Rizzo

More information

Intelligent Search for Answering Clinical Questions Coronado Group, Ltd. Innovation Initiatives

Intelligent Search for Answering Clinical Questions Coronado Group, Ltd. Innovation Initiatives Intelligent Search for Answering Clinical Questions Coronado Group, Ltd. Innovation Initiatives Search The Way You Think Copyright 2009 Coronado, Ltd. All rights reserved. All other product names and logos

More information

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

More information

A Comparative Approach to Search Engine Ranking Strategies

A Comparative Approach to Search Engine Ranking Strategies 26 A Comparative Approach to Search Engine Ranking Strategies Dharminder Singh 1, Ashwani Sethi 2 Guru Gobind Singh Collage of Engineering & Technology Guru Kashi University Talwandi Sabo, Bathinda, Punjab

More information

Enhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects

Enhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects Enhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects Mohammad Farahmand, Abu Bakar MD Sultan, Masrah Azrifah Azmi Murad, Fatimah Sidi me@shahroozfarahmand.com

More information

Clustering Connectionist and Statistical Language Processing

Clustering Connectionist and Statistical Language Processing Clustering Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes Clustering p.1/21 Overview clustering vs. classification supervised

More information

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02) Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we

More information

Qualitative Corporate Dashboards for Corporate Monitoring Peng Jia and Miklos A. Vasarhelyi 1

Qualitative Corporate Dashboards for Corporate Monitoring Peng Jia and Miklos A. Vasarhelyi 1 Qualitative Corporate Dashboards for Corporate Monitoring Peng Jia and Miklos A. Vasarhelyi 1 Introduction Electronic Commerce 2 is accelerating dramatically changes in the business process. Electronic

More information

CONCEPTCLASSIFIER FOR SHAREPOINT

CONCEPTCLASSIFIER FOR SHAREPOINT CONCEPTCLASSIFIER FOR SHAREPOINT PRODUCT OVERVIEW The only SharePoint 2007 and 2010 solution that delivers automatic conceptual metadata generation, auto-classification and powerful taxonomy tools running

More information

CitationBase: A social tagging management portal for references

CitationBase: A social tagging management portal for references CitationBase: A social tagging management portal for references Martin Hofmann Department of Computer Science, University of Innsbruck, Austria m_ho@aon.at Ying Ding School of Library and Information Science,

More information

Semantic Search in Portals using Ontologies

Semantic Search in Portals using Ontologies Semantic Search in Portals using Ontologies Wallace Anacleto Pinheiro Ana Maria de C. Moura Military Institute of Engineering - IME/RJ Department of Computer Engineering - Rio de Janeiro - Brazil [awallace,anamoura]@de9.ime.eb.br

More information

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project Ahmet Suerdem Istanbul Bilgi University; LSE Methodology Dept. Science in the media project is funded

More information

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek Recommender Systems: Content-based, Knowledge-based, Hybrid Radek Pelánek 2015 Today lecture, basic principles: content-based knowledge-based hybrid, choice of approach,... critiquing, explanations,...

More information

WRITING FOR THE WEB. Lynn Villeneuve lynn@astrolabewebsites.ca

WRITING FOR THE WEB. Lynn Villeneuve lynn@astrolabewebsites.ca . WRITING FOR THE WEB Lynn Villeneuve lynn@astrolabewebsites.ca Adopting a specialized writing style for the web is important for reasons such as readability, search engine optimization and accessibility.

More information

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts Julio Villena-Román 1,3, Sara Lana-Serrano 2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS

More information

Comparing IPL2 and Yahoo! Answers: A Case Study of Digital Reference and Community Based Question Answering

Comparing IPL2 and Yahoo! Answers: A Case Study of Digital Reference and Community Based Question Answering Comparing and : A Case Study of Digital Reference and Community Based Answering Dan Wu 1 and Daqing He 1 School of Information Management, Wuhan University School of Information Sciences, University of

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

2015 Workshops for Professors

2015 Workshops for Professors SAS Education Grow with us Offered by the SAS Global Academic Program Supporting teaching, learning and research in higher education 2015 Workshops for Professors 1 Workshops for Professors As the market

More information

Fogbeam Vision Series - The Modern Intranet

Fogbeam Vision Series - The Modern Intranet Fogbeam Labs Cut Through The Information Fog http://www.fogbeam.com Fogbeam Vision Series - The Modern Intranet Where It All Started Intranets began to appear as a venue for collaboration and knowledge

More information

LinksTo A Web2.0 System that Utilises Linked Data Principles to Link Related Resources Together

LinksTo A Web2.0 System that Utilises Linked Data Principles to Link Related Resources Together LinksTo A Web2.0 System that Utilises Linked Data Principles to Link Related Resources Together Owen Sacco 1 and Matthew Montebello 1, 1 University of Malta, Msida MSD 2080, Malta. {osac001, matthew.montebello}@um.edu.mt

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Taxonomies in Practice Welcome to the second decade of online taxonomy construction

Taxonomies in Practice Welcome to the second decade of online taxonomy construction Building a Taxonomy for Auto-classification by Wendi Pohs EDITOR S SUMMARY Taxonomies have expanded from browsing aids to the foundation for automatic classification. Early auto-classification methods

More information

Facilitating Business Process Discovery using Email Analysis

Facilitating Business Process Discovery using Email Analysis Facilitating Business Process Discovery using Email Analysis Matin Mavaddat Matin.Mavaddat@live.uwe.ac.uk Stewart Green Stewart.Green Ian Beeson Ian.Beeson Jin Sa Jin.Sa Abstract Extracting business process

More information

Why are Organizations Interested?

Why are Organizations Interested? SAS Text Analytics Mary-Elizabeth ( M-E ) Eddlestone SAS Customer Loyalty M-E.Eddlestone@sas.com +1 (607) 256-7929 Why are Organizations Interested? Text Analytics 2009: User Perspectives on Solutions

More information

Building a Question Classifier for a TREC-Style Question Answering System

Building a Question Classifier for a TREC-Style Question Answering System Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given

More information

Clustering Technique in Data Mining for Text Documents

Clustering Technique in Data Mining for Text Documents Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor

More information

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet Muhammad Atif Qureshi 1,2, Arjumand Younus 1,2, Colm O Riordan 1,

More information

A Comparison of Word- and Term-based Methods for Automatic Web Site Summarization

A Comparison of Word- and Term-based Methods for Automatic Web Site Summarization A Comparison of Word- and Term-based Methods for Automatic Web Site Summarization Yongzheng Zhang Evangelos Milios Nur Zincir-Heywood Faculty of Computer Science, Dalhousie University 6050 University Avenue,

More information

Untangle Your Information Four Steps to Integrating Knowledge with Topic Maps

Untangle Your Information Four Steps to Integrating Knowledge with Topic Maps White Paper Untangle Your Information Four Steps to Integrating Knowledge with Topic Maps Executive Summary For years, organizations have sought to improve the way they share information and knowledge

More information

UTILIZING COMPOUND TERM PROCESSING TO ADDRESS RECORDS MANAGEMENT CHALLENGES

UTILIZING COMPOUND TERM PROCESSING TO ADDRESS RECORDS MANAGEMENT CHALLENGES UTILIZING COMPOUND TERM PROCESSING TO ADDRESS RECORDS MANAGEMENT CHALLENGES CONCEPT SEARCHING This document discusses some of the inherent challenges in implementing and maintaining a sound records management

More information

Financial Trading System using Combination of Textual and Numerical Data

Financial Trading System using Combination of Textual and Numerical Data Financial Trading System using Combination of Textual and Numerical Data Shital N. Dange Computer Science Department, Walchand Institute of Rajesh V. Argiddi Assistant Prof. Computer Science Department,

More information

Universal. Event. Product. Computer. 1 warehouse.

Universal. Event. Product. Computer. 1 warehouse. Dynamic multi-dimensional models for text warehouses Maria Zamr Bleyberg, Karthik Ganesh Computing and Information Sciences Department Kansas State University, Manhattan, KS, 66506 Abstract In this paper,

More information

Towards a Visually Enhanced Medical Search Engine

Towards a Visually Enhanced Medical Search Engine Towards a Visually Enhanced Medical Search Engine Lavish Lalwani 1,2, Guido Zuccon 1, Mohamed Sharaf 2, Anthony Nguyen 1 1 The Australian e-health Research Centre, Brisbane, Queensland, Australia; 2 The

More information

Exploiting Tag Clouds for Database Browsing and Querying

Exploiting Tag Clouds for Database Browsing and Querying Exploiting Tag Clouds for Database Browsing and Querying Stefania Leone, Matthias Geel, and Moira C. Norrie Institute for Information Systems, ETH Zurich CH-8092 Zurich, Switzerland {leone geel norrie}@inf.ethz.ch

More information

SPATIAL DATA CLASSIFICATION AND DATA MINING

SPATIAL DATA CLASSIFICATION AND DATA MINING , pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal

More information

Text Analytics Beginner s Guide. Extracting Meaning from Unstructured Data

Text Analytics Beginner s Guide. Extracting Meaning from Unstructured Data Text Analytics Beginner s Guide Extracting Meaning from Unstructured Data Contents Text Analytics 3 Use Cases 7 Terms 9 Trends 14 Scenario 15 Resources 24 2 2013 Angoss Software Corporation. All rights

More information

Search Result Optimization using Annotators

Search Result Optimization using Annotators Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,

More information

An Unsupervised Approach to Domain-Specific Term Extraction

An Unsupervised Approach to Domain-Specific Term Extraction An Unsupervised Approach to Domain-Specific Term Extraction Su Nam Kim, Timothy Baldwin CSSE NICTA VRL University of Melbourne VIC 3010 Australia sunamkim@gmail.com, tb@ldwin.net Min-Yen Kan School of

More information

An Exploration of Visual Indexing:

An Exploration of Visual Indexing: An Exploration of Visual Indexing: Problems and Prospects Corinne Jörgensen Peter Jörgensen Florida State University School of Information Studies Ascendancy of Images 1994 VCR Video Laser Disc

More information

Web Document Clustering

Web Document Clustering Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

More information

72. Ontology Driven Knowledge Discovery Process: a proposal to integrate Ontology Engineering and KDD

72. Ontology Driven Knowledge Discovery Process: a proposal to integrate Ontology Engineering and KDD 72. Ontology Driven Knowledge Discovery Process: a proposal to integrate Ontology Engineering and KDD Paulo Gottgtroy Auckland University of Technology Paulo.gottgtroy@aut.ac.nz Abstract This paper is

More information

A Visual Tagging Technique for Annotating Large-Volume Multimedia Databases

A Visual Tagging Technique for Annotating Large-Volume Multimedia Databases A Visual Tagging Technique for Annotating Large-Volume Multimedia Databases A tool for adding semantic value to improve information filtering (Post Workshop revised version, November 1997) Konstantinos

More information

Self Organizing Maps for Visualization of Categories

Self Organizing Maps for Visualization of Categories Self Organizing Maps for Visualization of Categories Julian Szymański 1 and Włodzisław Duch 2,3 1 Department of Computer Systems Architecture, Gdańsk University of Technology, Poland, julian.szymanski@eti.pg.gda.pl

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

A Framework of User-Driven Data Analytics in the Cloud for Course Management

A Framework of User-Driven Data Analytics in the Cloud for Course Management A Framework of User-Driven Data Analytics in the Cloud for Course Management Jie ZHANG 1, William Chandra TJHI 2, Bu Sung LEE 1, Kee Khoon LEE 2, Julita VASSILEVA 3 & Chee Kit LOOI 4 1 School of Computer

More information

Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision

Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision Viktor PEKAR Bashkir State University Ufa, Russia, 450000 vpekar@ufanet.ru Steffen STAAB Institute AIFB,

More information

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Data Mining in Web Search Engine Optimization and User Assisted Rank Results Data Mining in Web Search Engine Optimization and User Assisted Rank Results Minky Jindal Institute of Technology and Management Gurgaon 122017, Haryana, India Nisha kharb Institute of Technology and Management

More information

Analysis of Data Mining Concepts in Higher Education with Needs to Najran University

Analysis of Data Mining Concepts in Higher Education with Needs to Najran University 590 Analysis of Data Mining Concepts in Higher Education with Needs to Najran University Mohamed Hussain Tawarish 1, Farooqui Waseemuddin 2 Department of Computer Science, Najran Community College. Najran

More information

Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND II. PROBLEM AND SOLUTION

Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND II. PROBLEM AND SOLUTION Brian Lao - bjlao Karthik Jagadeesh - kjag Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND There is a large need for improved access to legal help. For example,

More information

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words , pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan

More information

Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization

Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization Atika Mustafa, Ali Akbar, and Ahmer Sultan National University of Computer and Emerging

More information

Michelle Light, University of California, Irvine EAD @ 10, August 31, 2008. The endangerment of trees

Michelle Light, University of California, Irvine EAD @ 10, August 31, 2008. The endangerment of trees Michelle Light, University of California, Irvine EAD @ 10, August 31, 2008 The endangerment of trees Last year, when I was participating on a committee to redesign the Online Archive of California, many

More information

Technical Report. The KNIME Text Processing Feature:

Technical Report. The KNIME Text Processing Feature: Technical Report The KNIME Text Processing Feature: An Introduction Dr. Killian Thiel Dr. Michael Berthold Killian.Thiel@uni-konstanz.de Michael.Berthold@uni-konstanz.de Copyright 2012 by KNIME.com AG

More information

Fig (1) (a) Server-side scripting with PHP. (b) Client-side scripting with JavaScript.

Fig (1) (a) Server-side scripting with PHP. (b) Client-side scripting with JavaScript. Client-Side Dynamic Web Page Generation CGI, PHP, JSP, and ASP scripts solve the problem of handling forms and interactions with databases on the server. They can all accept incoming information from forms,

More information

Text Mining and Analysis

Text Mining and Analysis Text Mining and Analysis Practical Methods, Examples, and Case Studies Using SAS Goutam Chakraborty, Murali Pagolu, Satish Garla From Text Mining and Analysis. Full book available for purchase here. Contents

More information

MLg. Big Data and Its Implication to Research Methodologies and Funding. Cornelia Caragea TARDIS 2014. November 7, 2014. Machine Learning Group

MLg. Big Data and Its Implication to Research Methodologies and Funding. Cornelia Caragea TARDIS 2014. November 7, 2014. Machine Learning Group Big Data and Its Implication to Research Methodologies and Funding Cornelia Caragea TARDIS 2014 November 7, 2014 UNT Computer Science and Engineering Data Everywhere Lots of data is being collected and

More information

Finding Advertising Keywords on Web Pages

Finding Advertising Keywords on Web Pages Finding Advertising Keywords on Web Pages Wen-tau Yih Microsoft Research 1 Microsoft Way Redmond, WA 98052 scottyih@microsoft.com Joshua Goodman Microsoft Research 1 Microsoft Way Redmond, WA 98052 joshuago@microsoft.com

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

A Survey of Text Mining Techniques and Applications

A Survey of Text Mining Techniques and Applications 60 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 1, NO. 1, AUGUST 2009 A Survey of Text Mining Techniques and Applications Vishal Gupta Lecturer Computer Science & Engineering, University

More information

Computer Aided Document Indexing System

Computer Aided Document Indexing System Computer Aided Document Indexing System Mladen Kolar, Igor Vukmirović, Bojana Dalbelo Bašić, Jan Šnajder Faculty of Electrical Engineering and Computing, University of Zagreb Unska 3, 0000 Zagreb, Croatia

More information

The Analysis of Online Communities using Interactive Content-based Social Networks

The Analysis of Online Communities using Interactive Content-based Social Networks The Analysis of Online Communities using Interactive Content-based Social Networks Anatoliy Gruzd Graduate School of Library and Information Science, University of Illinois at Urbana- Champaign, agruzd2@uiuc.edu

More information

Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval

Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Diagnosis Code Assignment Support Using Random Indexing of Patient Records A Qualitative Feasibility Study

Diagnosis Code Assignment Support Using Random Indexing of Patient Records A Qualitative Feasibility Study Diagnosis Code Assignment Support Using Random Indexing of Patient Records A Qualitative Feasibility Study Aron Henriksson 1, Martin Hassel 1, and Maria Kvist 1,2 1 Department of Computer and System Sciences

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

Journal of Global Research in Computer Science RESEARCH SUPPORT SYSTEMS AS AN EFFECTIVE WEB BASED INFORMATION SYSTEM

Journal of Global Research in Computer Science RESEARCH SUPPORT SYSTEMS AS AN EFFECTIVE WEB BASED INFORMATION SYSTEM Volume 2, No. 5, May 2011 Journal of Global Research in Computer Science REVIEW ARTICLE Available Online at www.jgrcs.info RESEARCH SUPPORT SYSTEMS AS AN EFFECTIVE WEB BASED INFORMATION SYSTEM Sheilini

More information

SEO 101. Learning the basics of search engine optimization. Marketing & Web Services

SEO 101. Learning the basics of search engine optimization. Marketing & Web Services SEO 101 Learning the basics of search engine optimization Marketing & Web Services Table of Contents SEARCH ENGINE OPTIMIZATION BASICS WHAT IS SEO? WHY IS SEO IMPORTANT? WHERE ARE PEOPLE SEARCHING? HOW

More information

Statistics for BIG data

Statistics for BIG data Statistics for BIG data Statistics for Big Data: Are Statisticians Ready? Dennis Lin Department of Statistics The Pennsylvania State University John Jordan and Dennis K.J. Lin (ICSA-Bulletine 2014) Before

More information

Web Mining using Artificial Ant Colonies : A Survey

Web Mining using Artificial Ant Colonies : A Survey Web Mining using Artificial Ant Colonies : A Survey Richa Gupta Department of Computer Science University of Delhi ABSTRACT : Web mining has been very crucial to any organization as it provides useful

More information

Intelligent Log Analyzer. André Restivo <andre.restivo@portugalmail.pt>

Intelligent Log Analyzer. André Restivo <andre.restivo@portugalmail.pt> Intelligent Log Analyzer André Restivo 9th January 2003 Abstract Server Administrators often have to analyze server logs to find if something is wrong with their machines.

More information

Sentiment analysis on tweets in a financial domain

Sentiment analysis on tweets in a financial domain Sentiment analysis on tweets in a financial domain Jasmina Smailović 1,2, Miha Grčar 1, Martin Žnidaršič 1 1 Dept of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan International

More information

Structured Content: the Key to Agile. Web Experience Management. Introduction

Structured Content: the Key to Agile. Web Experience Management. Introduction Structured Content: the Key to Agile CONTENTS Introduction....................... 1 Structured Content Defined...2 Structured Content is Intelligent...2 Structured Content and Customer Experience...3 Structured

More information

PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS. PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS Project Project Title Area of Abstract No Specialization 1. Software

More information

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or

More information

Automated News Item Categorization

Automated News Item Categorization Automated News Item Categorization Hrvoje Bacan, Igor S. Pandzic* Department of Telecommunications, Faculty of Electrical Engineering and Computing, University of Zagreb, Croatia {Hrvoje.Bacan,Igor.Pandzic}@fer.hr

More information

Survey Results: Requirements and Use Cases for Linguistic Linked Data

Survey Results: Requirements and Use Cases for Linguistic Linked Data Survey Results: Requirements and Use Cases for Linguistic Linked Data 1 Introduction This survey was conducted by the FP7 Project LIDER (http://www.lider-project.eu/) as input into the W3C Community Group

More information

ecommerce Web-Site Trust Assessment Framework Based on Web Mining Approach

ecommerce Web-Site Trust Assessment Framework Based on Web Mining Approach ecommerce Web-Site Trust Assessment Framework Based on Web Mining Approach ecommerce Web-Site Trust Assessment Framework Based on Web Mining Approach Banatus Soiraya Faculty of Technology King Mongkut's

More information

Chapter Managing Knowledge in the Digital Firm

Chapter Managing Knowledge in the Digital Firm Chapter Managing Knowledge in the Digital Firm Essay Questions: 1. What is knowledge management? Briefly outline the knowledge management chain. 2. Identify the three major types of knowledge management

More information

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset. White Paper Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset. Using LSI for Implementing Document Management Systems By Mike Harrison, Director,

More information

Inner Classification of Clusters for Online News

Inner Classification of Clusters for Online News Inner Classification of Clusters for Online News Harmandeep Kaur 1, Sheenam Malhotra 2 1 (Computer Science and Engineering Department, Shri Guru Granth Sahib World University Fatehgarh Sahib) 2 (Assistant

More information

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING ABSTRACT In most CRM (Customer Relationship Management) systems, information on

More information