Experiments in Clustering Homogeneous XML Documents to Validate an Existing Typology

Size: px
Start display at page:

Download "Experiments in Clustering Homogeneous XML Documents to Validate an Existing Typology"

Transcription

1 Experiments in Clustering Homogeneous XML Documents to Validate an Existing Typology Thierry Despeyroux Yves Lechevallier Brigitte Trousse Anne-Marie Vercoustre (AxIS research team, Inria, France Abstract: This paper presents some experiments in clustering homogeneous XML documents to validate an existing classification or more generally an organisational structure. Our approach integrates techniques for extracting knowledge from documents with unsupervised classification (clustering) of documents. We focus on the feature selection used for representing documents and its impact on the emerging classification. We mix the selection of structured features with fine textual selection based on syntactic characteristics. We illustrate and evaluate this approach with a collection of Inria activity reports for the year The objective is to cluster projects into larger groups (Themes), based on the keywords or different chapters of these activity reports. We then compare the results of clustering using different feature selections, with the official theme structure used by Inria. Key Words: XML clustering, categorisation, organisational structure, knowledge discovery Category: H.3.1, I.5.3, I Introduction With the increasing amount of available information, sophisticated tools for supporting users in finding useful information are needed. In addition to tools for retrieving relevant documents, there is a need for tools that synthesise and exhibit information that is not explicitly contained in the document collection, using document mining techniques. Document mining objectives include extracting structured information from rough text, as well as document classification and clustering. Classification aims at associating documents to one or several predefined categories, while the objective of clustering is to identify emerging classes that are not known in advance. Traditional approaches for document classification and clustering rely on various statistical models, and representation of documents mostly based on bags of words. An important characteristic of text categorisation is the size of the vocabulary, which is often referred as the high dimension of the feature space. Automatic feature selection methods have been proposed to reduce the dimension of the space. They usually try to identify representative words that

2 can discriminate documents between various classes. For a comparison of those methods for classification see [Yang and Pederson 1997]. XML documents are becoming ubiquitous because of their rich and flexible format that can be used for a variety of applications. Standard methods have been used to classify XML documents, reducing them to their textual parts. These approaches do not take advantage of the structure of XML documents that also carries important information. In this paper we focus on XML documents and we study the impact of selecting (different) parts of the documents for a specific clustering task. The idea is that different parts of XML documents correspond to different dimensions of the collection that may play different roles in the classification task. We therefore consider two levels of feature selection: 1) coarse selection at the structure level and 2) fine linguistic selection of words within the text of elements. Based on the selected features the documents are then clustered using a dynamical classification algorithm that builds a prototype of each cluster as the union of all the features (words) of the documents belonging to this cluster. Furthermore, for each resulting cluster, we can exhibit the words that discriminate this cluster. Our experimentations use the collection of activity reports that were produced by the research groups at Inria. The task is to identify meaningful themes that would group projects working in related research domains. These groups are then compared to two Expert grouping, the first one used by Inria until 2003, and the new one proposed in Inria Activity Report Every year, Inria (The French National Institute for Research in Computer Science and Control) publishes an activity report (RA) made available to the French parliament and to our industrial and research partners. Traditionally produced as a paper document, this report is now published as a CD-Rom and the scientific part is made available on the Web since 1996 (in HTML and PDF). It is a collection of reports written by every Inria research team (in English since 2002). The XML version of these documents contains 139 files, a total of lines, more than 14.8 Mbytes of data. If the logical structure is defined by a DTD, the overall style and content are very flexible and unconstrained. The top level part of the DTD is given below: <!ELEMENT raweb (header, moreinfo?, members, presentation, foundation?, domain?, software?, results, contracts?, international?, dissemination?, biblio)> <!ATTLIST raweb year CDATA #IMPLIED > Mandatory sections include the list of team members, presentation of objectives, new results, and the list of publications for the year (biblio). Optional

3 sections include research foundation, application domains, software, as well as international and national cooperations. Inria research teams are also grouped into scientific themes that act as virtual structures for the purpose of presentation, communication and evaluation. The number of themes and allocation of teams to themes were decided some years ago by the board of directors and have changed recently. Choice of themes and team allocation are mostly related to strategic objectives and scientific closeness between existing teams. This has motivated our experiments in comparing the automatic clustering of teams, based on self-descriptions in their activity reports, with the two sets of themes defined by the Organisation. We will call them Expert Themes 2003 and 2004 respectively. Without anticipating on the results of the experiments, we are interested in discovering possible natural grouping of teams, identifying the keywords that better characterise those groups, and the potential difference with the organisational structures. 3 Methodology for Clustering XML Documents As said above, our objective is to cluster the research teams in themes, using their activity reports as data source. We hypothesis that activity reports reflect the research domains the teams are involved in and that some parts or the reports are more representative than others in describing research. For example, conferences and journals where researchers publish are indicative of their research fields. 3.1 Structure Selection and Vocabulary Definition The first step consists in selecting various parts of the XML documents that may be relevant for the classification task. This extraction uses the tools described in [Despeyroux 2004] to extract the text of elements, but standard XML tools could be used instead when the extraction does not require any inference. As we expect that various parts of the activity report would play different roles in classifying teams, we ran five experiments using different parts of the activity report, that are well-identified XML elements. We call this process structured feature selection. In this step, the documents are represented by the text of the selected elements. 1. Experiment K-F: Keywords attached to the foundation part 2. Experiment K-all: Keywords, whatever the sections they are attached to. 3. Experiment T-P: Full text of the presentation part 4. Experiment T-PF: Full text of the presentation and foundation parts 5. Experiment T-C: Names of conferences, workshops, congress, etc,

4 Experiences number extrated selected vocaof projects words words bulary K-F K-all T-P T-PF T-C Table 1: Size of data for the various experiments The goal of these experiments is to evaluate which parts are more relevant for the clustering task. The second processing step consists in the automatic selection of significant words within the previously selected texts. Classical methods of textual feature selection are based on statistical approaches, for example selection based on word frequency (DF) or information gain (IG). These methods works well for large collections of texts and involved pre-processing of the full collection. In our case the frequency of words may vary depending on the selected parts of documents and the size of the resulting collection can be very heterogeneous from one experiment to the other. In order to avoid heterogeneous frequency, we chose a natural language approach where words are tagged and selected according to their syntactic role in the sentence. We use TreeTagger, a tool for annotating text with part-of-speech and lemma information, developed at the Institute for Computational Linguistics of the University of Stuttgart [Schmid 1994, Schmid 1995]. We retain different types of words, depending on the structured feature selection. For experiments K-F and K-all (keywords) we keep nouns, verbs, adjectives, (excluding conjunctions, unknown words, etc.), while for experiments T-PF and T-P (full text), we keep only the nouns to limit the number of features. There is one difficulty with conference names due to their very free and heterogeneous labelling : some teams would use the full name of the conference, others the acronym in various formats (e.g. POPL 03, POPL03, POPL 2003). We therefore built manually a normalized list of all the conference names and matched automatically the form used in the RA with the normalized form. Since conference acronyms are significant and unknown to the tagger, we decided not to use the tagger for this experiment, keeping all the words but stop words (such as proceedings, conference, etc.). Finally, for all experiments, words that are not used at least by two teams are removed. Table 1 summarises the size of data (words) used in each experiment. 3.2 Clustering Method and External Evaluation The third step is clustering of documents in a set of disjoint classes using the vocabulary defined for the five experiments as described above. Our clustering algorithm is based on the partitioning method proposed by [Celeux et al 1989], where the distances between clusters is based on the fre-

5 quency of the words of the selected vocabulary. This approach is equivalent to the k-means algorithm. As for the k-means we represent the clusters by prototypes which summarize the whole information of the document s belonging to each of them. More precisely, if the vocabulary counts p words, each document s is represented by the vector x s = (x 1 s,..., xj s,..., xp s ) where xj s is the number of occurences of word x j in the document s, then the prototype g for a class U i is represented by g i = (gi 1,..., gj i,..., gp i ) with gj i = s U i x j s. Finally, the prototype of each class been fixed, every element is assigned to a class according to its proximity to the prototype. The proximity is measured by a classical distance between distributions (e.g. chi-squared). We evaluate the quality of our automatic clustering by comparing the results with the two sets of themes used by Inria. We call this evaluation external validity, since the clustering process does not involve those themes. For all quantitative evaluations we use two complementary measures: the F-measure and the corrected Rand index. The F-measure proposed by [Jardine and Rijsbergen 1963] combines the precision and recall measures from information retrieval and treats each cluster as if it was the result of a query and each class as if it was the desired answer to that query. For a priori group U i ; and cluster C j, recall(i,j) is equal to n ij /n i. and precision(i,j) is equal to n ij /n.j, where n ij is the number of documents in a group U i and the cluster C j ; n i. the number of documents in a priori group U i ; n.j the number of documents in cluster C j. Then, the F-measure between U i and C j is given by F(i,j)=(2.*recall(i,j)*precision(i,j)/(recall(i,j)+precision(i,j)). The F-measure between a priori partition U and the partition C in K clusters is given by: F = k j=1 n.j n max (F(i, j)) (1) j, where n is the total number of documents in the data set. The corrected Rand (CR) index is defined in [Hubert and Arabie 1985] for comparing two partitions. We remind its definition. Let U = {U 1,...,U i,..., U r } and P = {C 1,...,C j,..., C k } be two partitions of the same data set having respectively r and k clusters. The corrected Rand index is: CR = r i=1 1 2 [ r i=1 k j=1 ( nij 2 ( ni. ) k 2 + j=1 ) ( n ) 1 r 2 i=1 ( n.j ) ( 2 ] n ) 1 r 2 i=1 ( ni. ) k ( n.j ) 2 j=1 2 ( ni. ) k 2 j=1 where ( ) n 2 = n(n 1) 2 and n ij, n i., n.j and n are defined as above. ( n.j ) (2) 2

6 Exp. Nb. of F Rand F Rand F Rand clusters Themes 2003 Themes 2003 subthemes subthemes Themes 2004 Themes 2004 K-F-a K-F-b K-F-c K-all-a K-all-b K-all-c T-P-a T-P-b T-P-c T-PF-a T-PF-b T-PF-c T-C-a T-C-b T-C-c Table 2: Results by external validity To conclude, the F-measure is easier to interpret and can support local analysis (through the F ij ), while the Rand gives a measure of the significance of the results for a given number of clusters. 3.3 Results Analysis Table 2 summarizes our results for different feature selections and different number of clusters (4, 5 and 9). We first analyse results for Themes 2003 and Themes 2004 separately, then compare between the two. For Themes 2003, we get the best results consistently for the two measures and all features when clustering into 4 clusters. One exception is clustering in 9 sub-themes using the text of both presentation and foundation (T-PF-c). The overall best result is obtained with four clusters using presentation and foundation (T-PF-a). A finer analysis using sub-themes (not presented here by lack of space), highlights good mapping between clusters and sub-themes. For Themes 2004, we get good results for clustering in 5 or 4 clusters, with the best results with 5 clusters when using all the keywords. In both cases, the sections about Foundation seem representative of the research domains, either through the full text of those sections or through their attached keywords, for the teams who provided such keywords. Overall, our automatic clustering compares better with Themes 2003 than with Themes 2004, with the exception of using all keywords for creating 5 clusters. There is not much difference when comparing with Themes2003 or Themes 2004 when using the conference names. Somehow disappointing results with conference names may be explained by not using the tagger, leaving us with too many different words (see table 1). We also note that the two measures, F-measure and corrected rand, are consistent trough the experiments: high F-measure scores correspond to good rand values.

7 4 Related Work Currently research in classification and clustering methods for XML or semistructured documents is very active. New document models have been proposed by ([Yi and Dundaresan 2000], [Denoyer et al 2003]) to extend the classical vector model and take into account both the structure and the textual part. It amounts to distinguish words appearing in different types of XML elements in a generic way, while our approach uses the structure to select (manually) the type of elements relevant to a specific mining objective. XML document clustering has been used mostly for visualizing large collections of documents, for example [Guillaume and Murtagh 2000] cluster AML (Astronomical Markup Language) documents based only on their links. [Jianwu and Xiaoou 2002] propose a model similar to [Yi and Dundaresan 2000] but adding in- and out-links to the model, and they use it for clustering rather than classification. [Yoon and Raghavan 2001] also propose a BitCube model for clustering that represents documents based on their epaths (paths of text elements) and textual content. Their focus is on evaluating time performance rather than clustering effectiveness. Another direction is clustering Web documents returned as answers to a query, an alternative to rank lists. [Zamir and Etziono 1998] propose an original algorithm using a suffix tree structure, that is linear in the size of the collection and incremental, an important feature to support online clustering. [Larsen and Aone 1999] compare different text feature extractions, and variants of a linear-time clustering algorithm using random seed selection with center adjustment. 5 Conclusion and Future Work In this paper we have presented a methodology for clustering XML documents and evaluate the results, for different feature selections, in comparison with two existing typologies. Although the analysis is closely related to our specific collection, we believe that the approach can be used in other contexts, for other XML collections where some knowledge of the semantic of the DTD is available. The results show that the quality of clustering strongly depends on the selected document features. In our application, clustering using foundation sections always outperforms clustering using keywords. This conclusion can be turned the other side down, as an indication for the organization that some parts of the Activity Report do not appropriately describe the research domains and that the choice of keywords and research presentation could be improved to carry a stronger message. On more technical aspects, our approach provides a flexible clustering framework where structured features and textual features can be selected indepen-

8 dently, although comparisons should be done with textual feature selection based on statistical approach (tf*idf). Finally we plan to carry further experiences with conference names using an ontology of conferences. A first step would be to build a good classifier able to match incomplete and incorrect conference or journal titles with their normalized forms. Acknowledgements: The authors wish to thank Mihai Jurca, engineer in the AxIS team for his useful support in the feature pre-processing of the data. References [Celeux et al 1989] Celeux G., Diday E., Govaert G., Lechevallier Y., Ralambondrainy, H. (1989): Classification Automatique des Données, Environnement statistique et informatique. Bordas, Paris. [Denoyer et al 2003] Denoyer L., Vittaut J.-N., Allinari P., Brunessaux S., Brunessaux S.: Structured Multimedia Document Classification, in DocEng 03, November 20-22, 2003, Grenoble, France, article/783.pdf. [Despeyroux 2004] Despeyroux Th.: Practical Semantic Analysis of Web Sites and Documents, in proceedings of the 13th World Wide Web Conference (WWW2004) New York City, May (2004). [Guillaume and Murtagh 2000] Guillaume D., Murtagh F.: Clustering of XML documents, in Computer Physics communications 127 (2000) [Hubert and Arabie 1985] Hubert L., Arabie P. (1985), Comparing Partitions, Journal of Classification, Vol. 2, pp [Jardine and Rijsbergen 1963] Jardine, N., van Rijsbergen, C.J.: The use of hierarchic clustering in information retrieval, Information Storage retrieval, 7, pp , [Jianwu and Xiaoou 2002] Jianwu Y., Xiaoou C.: A semi-structured document model for text mining, Journal of Computer Science and Technology archive, Volume 17(5), , May (2002), [Larsen and Aone 1999] Larsen B., Aone C.: Fast and effective text mining using linear-time document clustering, In: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, (1999), pp [Schmid 1994] Schmid H.: Probabilistic Part-of-Speech Tagging Using Decision Trees, revised version, original work in Proc. of the International Conference on New Methods in Language Processing, Manchester, UK, pp , (1994), [Schmid 1995] Schmid H.: Improvements in Part-of-Speech Tagging With an Application To German, revised version, original work in Proc. the EACL SIGDAT workshop, Dublin, (1995), [Yang and Pederson 1997] Yang Y., Pederson J.O.: A comparative study on Feature Selection in text categorisation, in Proceedings of the Fourteenth International Conference on Machine Learning, pages , Morgan Kaufmann, (1997). [Yi and Dundaresan 2000] Yi J. and Dundaresan N.: A classifier for semi-structured documents, in Proc. of the 6th International Conference on Knowledge Discovery and Data mining (KDD), (200), [Yoon and Raghavan 2001] Yoon J., Raghavan V.: BitCube: Clustering and Statistical Analysis for XML Documents, Journal of Intelligent Information Systems, (2001). [Zamir and Etziono 1998] Zamir O., Etzioni O.: Web Document clustering: A feasibility demonstration, in ACM Conf SIGIR98, Melbourne, Australia, (1998).

Search Result Optimization using Annotators

Search Result Optimization using Annotators Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

Clustering Technique in Data Mining for Text Documents

Clustering Technique in Data Mining for Text Documents Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it Web Mining Margherita Berardi LACAM Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it Bari, 24 Aprile 2003 Overview Introduction Knowledge discovery from text (Web Content

More information

Term extraction for user profiling: evaluation by the user

Term extraction for user profiling: evaluation by the user Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,

More information

Exam in course TDT4215 Web Intelligence - Solutions and guidelines -

Exam in course TDT4215 Web Intelligence - Solutions and guidelines - English Student no:... Page 1 of 12 Contact during the exam: Geir Solskinnsbakk Phone: 94218 Exam in course TDT4215 Web Intelligence - Solutions and guidelines - Friday May 21, 2010 Time: 0900-1300 Allowed

More information

Clustering Connectionist and Statistical Language Processing

Clustering Connectionist and Statistical Language Processing Clustering Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes Clustering p.1/21 Overview clustering vs. classification supervised

More information

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words , pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan

More information

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER INTRODUCTION TO SAS TEXT MINER TODAY S AGENDA INTRODUCTION TO SAS TEXT MINER Define data mining Overview of SAS Enterprise Miner Describe text analytics and define text data mining Text Mining Process

More information

A Comparative Study on Sentiment Classification and Ranking on Product Reviews

A Comparative Study on Sentiment Classification and Ranking on Product Reviews A Comparative Study on Sentiment Classification and Ranking on Product Reviews C.EMELDA Research Scholar, PG and Research Department of Computer Science, Nehru Memorial College, Putthanampatti, Bharathidasan

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

ISSN: 2348 9510. A Review: Image Retrieval Using Web Multimedia Mining

ISSN: 2348 9510. A Review: Image Retrieval Using Web Multimedia Mining A Review: Image Retrieval Using Web Multimedia Satish Bansal*, K K Yadav** *, **Assistant Professor Prestige Institute Of Management, Gwalior (MP), India Abstract Multimedia object include audio, video,

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

Query Recommendation employing Query Logs in Search Optimization

Query Recommendation employing Query Logs in Search Optimization 1917 Query Recommendation employing Query Logs in Search Optimization Neha Singh Department of Computer Science, Shri Siddhi Vinayak Group of Institutions, Bareilly Email: singh26.neha@gmail.com Dr Manish

More information

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10 1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

More information

Blog Post Extraction Using Title Finding

Blog Post Extraction Using Title Finding Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School

More information

Bisecting K-Means for Clustering Web Log data

Bisecting K-Means for Clustering Web Log data Bisecting K-Means for Clustering Web Log data Ruchika R. Patil Department of Computer Technology YCCE Nagpur, India Amreen Khan Department of Computer Technology YCCE Nagpur, India ABSTRACT Web usage mining

More information

Mining Text Data: An Introduction

Mining Text Data: An Introduction Bölüm 10. Metin ve WEB Madenciliği http://ceng.gazi.edu.tr/~ozdemir Mining Text Data: An Introduction Data Mining / Knowledge Discovery Structured Data Multimedia Free Text Hypertext HomeLoan ( Frank Rizzo

More information

A Statistical Text Mining Method for Patent Analysis

A Statistical Text Mining Method for Patent Analysis A Statistical Text Mining Method for Patent Analysis Department of Statistics Cheongju University, shjun@cju.ac.kr Abstract Most text data from diverse document databases are unsuitable for analytical

More information

Dynamical Clustering of Personalized Web Search Results

Dynamical Clustering of Personalized Web Search Results Dynamical Clustering of Personalized Web Search Results Xuehua Shen CS Dept, UIUC xshen@cs.uiuc.edu Hong Cheng CS Dept, UIUC hcheng3@uiuc.edu Abstract Most current search engines present the user a ranked

More information

Universal. Event. Product. Computer. 1 warehouse.

Universal. Event. Product. Computer. 1 warehouse. Dynamic multi-dimensional models for text warehouses Maria Zamr Bleyberg, Karthik Ganesh Computing and Information Sciences Department Kansas State University, Manhattan, KS, 66506 Abstract In this paper,

More information

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data

More information

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction Data Mining and Exploration Data Mining and Exploration: Introduction Amos Storkey, School of Informatics January 10, 2006 http://www.inf.ed.ac.uk/teaching/courses/dme/ Course Introduction Welcome Administration

More information

Making the Most of Missing Values: Object Clustering with Partial Data in Astronomy

Making the Most of Missing Values: Object Clustering with Partial Data in Astronomy Astronomical Data Analysis Software and Systems XIV ASP Conference Series, Vol. XXX, 2005 P. L. Shopbell, M. C. Britton, and R. Ebert, eds. P2.1.25 Making the Most of Missing Values: Object Clustering

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

An Information Retrieval using weighted Index Terms in Natural Language document collections

An Information Retrieval using weighted Index Terms in Natural Language document collections Internet and Information Technology in Modern Organizations: Challenges & Answers 635 An Information Retrieval using weighted Index Terms in Natural Language document collections Ahmed A. A. Radwan, Minia

More information

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING Sumit Goswami 1 and Mayank Singh Shishodia 2 1 Indian Institute of Technology-Kharagpur, Kharagpur, India sumit_13@yahoo.com 2 School of Computer

More information

Report on the XML Mining Track at INEX 2005 and INEX 2006, Categorization and Clustering of XML Documents

Report on the XML Mining Track at INEX 2005 and INEX 2006, Categorization and Clustering of XML Documents Report on the XML Mining Track at INEX 2005 and INEX 2006, Categorization and Clustering of XML Documents Ludovic Denoyer, Patrick Gallinari, Anne-Marie Vercoustre To cite this version: Ludovic Denoyer,

More information

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD. Svetlana Sokolova President and CEO of PROMT, PhD. How the Computer Translates Machine translation is a special field of computer application where almost everyone believes that he/she is a specialist.

More information

Natural Language to Relational Query by Using Parsing Compiler

Natural Language to Relational Query by Using Parsing Compiler Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

Interactive Dynamic Information Extraction

Interactive Dynamic Information Extraction Interactive Dynamic Information Extraction Kathrin Eichler, Holmer Hemsen, Markus Löckelt, Günter Neumann, and Norbert Reithinger Deutsches Forschungszentrum für Künstliche Intelligenz - DFKI, 66123 Saarbrücken

More information

Incorporating Window-Based Passage-Level Evidence in Document Retrieval

Incorporating Window-Based Passage-Level Evidence in Document Retrieval Incorporating -Based Passage-Level Evidence in Document Retrieval Wensi Xi, Richard Xu-Rong, Christopher S.G. Khoo Center for Advanced Information Systems School of Applied Science Nanyang Technological

More information

An Introduction to Data Mining

An Introduction to Data Mining An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail

More information

IJCSES Vol.7 No.4 October 2013 pp.165-168 Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS

IJCSES Vol.7 No.4 October 2013 pp.165-168 Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS IJCSES Vol.7 No.4 October 2013 pp.165-168 Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS V.Sudhakar 1 and G. Draksha 2 Abstract:- Collective behavior refers to the behaviors of individuals

More information

Semantic Concept Based Retrieval of Software Bug Report with Feedback

Semantic Concept Based Retrieval of Software Bug Report with Feedback Semantic Concept Based Retrieval of Software Bug Report with Feedback Tao Zhang, Byungjeong Lee, Hanjoon Kim, Jaeho Lee, Sooyong Kang, and Ilhoon Shin Abstract Mining software bugs provides a way to develop

More information

Facilitating Business Process Discovery using Email Analysis

Facilitating Business Process Discovery using Email Analysis Facilitating Business Process Discovery using Email Analysis Matin Mavaddat Matin.Mavaddat@live.uwe.ac.uk Stewart Green Stewart.Green Ian Beeson Ian.Beeson Jin Sa Jin.Sa Abstract Extracting business process

More information

Inner Classification of Clusters for Online News

Inner Classification of Clusters for Online News Inner Classification of Clusters for Online News Harmandeep Kaur 1, Sheenam Malhotra 2 1 (Computer Science and Engineering Department, Shri Guru Granth Sahib World University Fatehgarh Sahib) 2 (Assistant

More information

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Data Mining in Web Search Engine Optimization and User Assisted Rank Results Data Mining in Web Search Engine Optimization and User Assisted Rank Results Minky Jindal Institute of Technology and Management Gurgaon 122017, Haryana, India Nisha kharb Institute of Technology and Management

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D. Data Mining on Social Networks Dionysios Sotiropoulos Ph.D. 1 Contents What are Social Media? Mathematical Representation of Social Networks Fundamental Data Mining Concepts Data Mining Tasks on Digital

More information

How To Use Data Mining For Knowledge Management In Technology Enhanced Learning

How To Use Data Mining For Knowledge Management In Technology Enhanced Learning Proceedings of the 6th WSEAS International Conference on Applications of Electrical Engineering, Istanbul, Turkey, May 27-29, 2007 115 Data Mining for Knowledge Management in Technology Enhanced Learning

More information

1 o Semestre 2007/2008

1 o Semestre 2007/2008 Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 Outline 1 2 3 4 5 Exploiting Text How is text exploited? Two main directions Extraction Extraction

More information

Prediction of Heart Disease Using Naïve Bayes Algorithm

Prediction of Heart Disease Using Naïve Bayes Algorithm Prediction of Heart Disease Using Naïve Bayes Algorithm R.Karthiyayini 1, S.Chithaara 2 Assistant Professor, Department of computer Applications, Anna University, BIT campus, Tiruchirapalli, Tamilnadu,

More information

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts Julio Villena-Román 1,3, Sara Lana-Serrano 2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS

More information

Sentiment analysis on tweets in a financial domain

Sentiment analysis on tweets in a financial domain Sentiment analysis on tweets in a financial domain Jasmina Smailović 1,2, Miha Grčar 1, Martin Žnidaršič 1 1 Dept of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan International

More information

Cloud Storage-based Intelligent Document Archiving for the Management of Big Data

Cloud Storage-based Intelligent Document Archiving for the Management of Big Data Cloud Storage-based Intelligent Document Archiving for the Management of Big Data Keedong Yoo Dept. of Management Information Systems Dankook University Cheonan, Republic of Korea Abstract : The cloud

More information

INTELLIGENT VIDEO SYNTHESIS USING VIRTUAL VIDEO PRESCRIPTIONS

INTELLIGENT VIDEO SYNTHESIS USING VIRTUAL VIDEO PRESCRIPTIONS INTELLIGENT VIDEO SYNTHESIS USING VIRTUAL VIDEO PRESCRIPTIONS C. A. LINDLEY CSIRO Mathematical and Information Sciences E6B, Macquarie University Campus, North Ryde, NSW, Australia 2113 E-mail: craig.lindley@cmis.csiro.au

More information

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM. DATA MINING TECHNOLOGY Georgiana Marin 1 Abstract In terms of data processing, classical statistical models are restrictive; it requires hypotheses, the knowledge and experience of specialists, equations,

More information

Learning outcomes. Knowledge and understanding. Competence and skills

Learning outcomes. Knowledge and understanding. Competence and skills Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges

More information

Mining Signatures in Healthcare Data Based on Event Sequences and its Applications

Mining Signatures in Healthcare Data Based on Event Sequences and its Applications Mining Signatures in Healthcare Data Based on Event Sequences and its Applications Siddhanth Gokarapu 1, J. Laxmi Narayana 2 1 Student, Computer Science & Engineering-Department, JNTU Hyderabad India 1

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Financial Trading System using Combination of Textual and Numerical Data

Financial Trading System using Combination of Textual and Numerical Data Financial Trading System using Combination of Textual and Numerical Data Shital N. Dange Computer Science Department, Walchand Institute of Rajesh V. Argiddi Assistant Prof. Computer Science Department,

More information

The Development of Multimedia-Multilingual Document Storage, Retrieval and Delivery System for E-Organization (STREDEO PROJECT)

The Development of Multimedia-Multilingual Document Storage, Retrieval and Delivery System for E-Organization (STREDEO PROJECT) The Development of Multimedia-Multilingual Storage, Retrieval and Delivery for E-Organization (STREDEO PROJECT) Asanee Kawtrakul, Kajornsak Julavittayanukool, Mukda Suktarachan, Patcharee Varasrai, Nathavit

More information

PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY

PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY QÜESTIIÓ, vol. 25, 3, p. 509-520, 2001 PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY GEORGES HÉBRAIL We present in this paper the main applications of data mining techniques at Electricité de France,

More information

Information Retrieval Systems in XML Based Database A review

Information Retrieval Systems in XML Based Database A review Information Retrieval Systems in XML Based Database A review Preeti Pandey 1, L.S.Maurya 2 Research Scholar, IT Department, SRMSCET, Bareilly, India 1 Associate Professor, IT Department, SRMSCET, Bareilly,

More information

OLAP Visualization Operator for Complex Data

OLAP Visualization Operator for Complex Data OLAP Visualization Operator for Complex Data Sabine Loudcher and Omar Boussaid ERIC laboratory, University of Lyon (University Lyon 2) 5 avenue Pierre Mendes-France, 69676 Bron Cedex, France Tel.: +33-4-78772320,

More information

Intinno: A Web Integrated Digital Library and Learning Content Management System

Intinno: A Web Integrated Digital Library and Learning Content Management System Intinno: A Web Integrated Digital Library and Learning Content Management System Synopsis of the Thesis to be submitted in Partial Fulfillment of the Requirements for the Award of the Degree of Master

More information

Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis IOSR Journal of Computer Engineering (IOSRJCE) ISSN: 2278-0661, ISBN: 2278-8727 Volume 6, Issue 5 (Nov. - Dec. 2012), PP 36-41 Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

More information

Semantic annotation of requirements for automatic UML class diagram generation

Semantic annotation of requirements for automatic UML class diagram generation www.ijcsi.org 259 Semantic annotation of requirements for automatic UML class diagram generation Soumaya Amdouni 1, Wahiba Ben Abdessalem Karaa 2 and Sondes Bouabid 3 1 University of tunis High Institute

More information

Role of Text Mining in Business Intelligence

Role of Text Mining in Business Intelligence Role of Text Mining in Business Intelligence Palak Gupta 1, Barkha Narang 2 Abstract This paper includes the combined study of business intelligence and text mining of uncertain data. The data that is

More information

Data mining in the e-learning domain

Data mining in the e-learning domain Data mining in the e-learning domain The author is Education Liaison Officer for e-learning, Knowsley Council and University of Liverpool, Wigan, UK. Keywords Higher education, Classification, Data encapsulation,

More information

Learning is a very general term denoting the way in which agents:

Learning is a very general term denoting the way in which agents: What is learning? Learning is a very general term denoting the way in which agents: Acquire and organize knowledge (by building, modifying and organizing internal representations of some external reality);

More information

Improving Data Driven Part-of-Speech Tagging by Morphologic Knowledge Induction

Improving Data Driven Part-of-Speech Tagging by Morphologic Knowledge Induction Improving Data Driven Part-of-Speech Tagging by Morphologic Knowledge Induction Uwe D. Reichel Department of Phonetics and Speech Communication University of Munich reichelu@phonetik.uni-muenchen.de Abstract

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli (alberto.ceselli@unimi.it)

More information

Data Mining Governance for Service Oriented Architecture

Data Mining Governance for Service Oriented Architecture Data Mining Governance for Service Oriented Architecture Ali Beklen Software Group IBM Turkey Istanbul, TURKEY alibek@tr.ibm.com Turgay Tugay Bilgin Dept. of Computer Engineering Maltepe University Istanbul,

More information

Building a Question Classifier for a TREC-Style Question Answering System

Building a Question Classifier for a TREC-Style Question Answering System Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given

More information

Self Organizing Maps for Visualization of Categories

Self Organizing Maps for Visualization of Categories Self Organizing Maps for Visualization of Categories Julian Szymański 1 and Włodzisław Duch 2,3 1 Department of Computer Systems Architecture, Gdańsk University of Technology, Poland, julian.szymanski@eti.pg.gda.pl

More information

A Study of Web Log Analysis Using Clustering Techniques

A Study of Web Log Analysis Using Clustering Techniques A Study of Web Log Analysis Using Clustering Techniques Hemanshu Rana 1, Mayank Patel 2 Assistant Professor, Dept of CSE, M.G Institute of Technical Education, Gujarat India 1 Assistant Professor, Dept

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance Shen Wang, Bin Wang and Hao Lang, Xueqi Cheng Institute of Computing Technology, Chinese Academy of

More information

Automatic Annotation Wrapper Generation and Mining Web Database Search Result

Automatic Annotation Wrapper Generation and Mining Web Database Search Result Automatic Annotation Wrapper Generation and Mining Web Database Search Result V.Yogam 1, K.Umamaheswari 2 1 PG student, ME Software Engineering, Anna University (BIT campus), Trichy, Tamil nadu, India

More information

3 Paraphrase Acquisition. 3.1 Overview. 2 Prior Work

3 Paraphrase Acquisition. 3.1 Overview. 2 Prior Work Unsupervised Paraphrase Acquisition via Relation Discovery Takaaki Hasegawa Cyberspace Laboratories Nippon Telegraph and Telephone Corporation 1-1 Hikarinooka, Yokosuka, Kanagawa 239-0847, Japan hasegawa.takaaki@lab.ntt.co.jp

More information

Graph Mining and Social Network Analysis

Graph Mining and Social Network Analysis Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann

More information

Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery

Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery Jan Paralic, Peter Smatana Technical University of Kosice, Slovakia Center for

More information

Topics in basic DBMS course

Topics in basic DBMS course Topics in basic DBMS course Database design Transaction processing Relational query languages (SQL), calculus, and algebra DBMS APIs Database tuning (physical database design) Basic query processing (ch

More information

Automatic Indexing of Scanned Documents - a Layout-based Approach

Automatic Indexing of Scanned Documents - a Layout-based Approach Automatic Indexing of Scanned Documents - a Layout-based Approach Daniel Esser a,danielschuster a, Klemens Muthmann a, Michael Berger b, Alexander Schill a a TU Dresden, Computer Networks Group, 01062

More information

Mining the Software Change Repository of a Legacy Telephony System

Mining the Software Change Repository of a Legacy Telephony System Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,

More information

COURSE RECOMMENDER SYSTEM IN E-LEARNING

COURSE RECOMMENDER SYSTEM IN E-LEARNING International Journal of Computer Science and Communication Vol. 3, No. 1, January-June 2012, pp. 159-164 COURSE RECOMMENDER SYSTEM IN E-LEARNING Sunita B Aher 1, Lobo L.M.R.J. 2 1 M.E. (CSE)-II, Walchand

More information

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization Ángela Blanco Universidad Pontificia de Salamanca ablancogo@upsa.es Spain Manuel Martín-Merino Universidad

More information

isecure: Integrating Learning Resources for Information Security Research and Education The isecure team

isecure: Integrating Learning Resources for Information Security Research and Education The isecure team isecure: Integrating Learning Resources for Information Security Research and Education The isecure team 1 isecure NSF-funded collaborative project (2012-2015) Faculty NJIT Vincent Oria Jim Geller Reza

More information

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework Usha Nandini D 1, Anish Gracias J 2 1 ushaduraisamy@yahoo.co.in 2 anishgracias@gmail.com Abstract A vast amount of assorted

More information

DMDSS: Data Mining Based Decision Support System to Integrate Data Mining and Decision Support

DMDSS: Data Mining Based Decision Support System to Integrate Data Mining and Decision Support DMDSS: Data Mining Based Decision Support System to Integrate Data Mining and Decision Support Rok Rupnik, Matjaž Kukar, Marko Bajec, Marjan Krisper University of Ljubljana, Faculty of Computer and Information

More information

QASM: a Q&A Social Media System Based on Social Semantics

QASM: a Q&A Social Media System Based on Social Semantics QASM: a Q&A Social Media System Based on Social Semantics Zide Meng, Fabien Gandon, Catherine Faron-Zucker To cite this version: Zide Meng, Fabien Gandon, Catherine Faron-Zucker. QASM: a Q&A Social Media

More information

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing Introduction to Data Mining and Machine Learning Techniques Iza Moise, Evangelos Pournaras, Dirk Helbing Iza Moise, Evangelos Pournaras, Dirk Helbing 1 Overview Main principles of data mining Definition

More information

Mining the Web of Linked Data with RapidMiner

Mining the Web of Linked Data with RapidMiner Mining the Web of Linked Data with RapidMiner Petar Ristoski, Christian Bizer, and Heiko Paulheim University of Mannheim, Germany Data and Web Science Group {petar.ristoski,heiko,chris}@informatik.uni-mannheim.de

More information

SPATIAL DATA CLASSIFICATION AND DATA MINING

SPATIAL DATA CLASSIFICATION AND DATA MINING , pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal

More information

Optimization of Internet Search based on Noun Phrases and Clustering Techniques

Optimization of Internet Search based on Noun Phrases and Clustering Techniques Optimization of Internet Search based on Noun Phrases and Clustering Techniques R. Subhashini Research Scholar, Sathyabama University, Chennai-119, India V. Jawahar Senthil Kumar Assistant Professor, Anna

More information

Journal of Global Research in Computer Science RESEARCH SUPPORT SYSTEMS AS AN EFFECTIVE WEB BASED INFORMATION SYSTEM

Journal of Global Research in Computer Science RESEARCH SUPPORT SYSTEMS AS AN EFFECTIVE WEB BASED INFORMATION SYSTEM Volume 2, No. 5, May 2011 Journal of Global Research in Computer Science REVIEW ARTICLE Available Online at www.jgrcs.info RESEARCH SUPPORT SYSTEMS AS AN EFFECTIVE WEB BASED INFORMATION SYSTEM Sheilini

More information

Projektgruppe. Categorization of text documents via classification

Projektgruppe. Categorization of text documents via classification Projektgruppe Steffen Beringer Categorization of text documents via classification 4. Juni 2010 Content Motivation Text categorization Classification in the machine learning Document indexing Construction

More information

Identifying Focus, Techniques and Domain of Scientific Papers

Identifying Focus, Techniques and Domain of Scientific Papers Identifying Focus, Techniques and Domain of Scientific Papers Sonal Gupta Department of Computer Science Stanford University Stanford, CA 94305 sonal@cs.stanford.edu Christopher D. Manning Department of

More information

Text Mining for Health Care and Medicine. Sophia Ananiadou Director National Centre for Text Mining www.nactem.ac.uk

Text Mining for Health Care and Medicine. Sophia Ananiadou Director National Centre for Text Mining www.nactem.ac.uk Text Mining for Health Care and Medicine Sophia Ananiadou Director National Centre for Text Mining www.nactem.ac.uk The Need for Text Mining MEDLINE 2005: ~14M 2009: ~18M Overwhelming information in textual,

More information

Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

More information

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset. White Paper Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset. Using LSI for Implementing Document Management Systems By Mike Harrison, Director,

More information

ISSUES ON FORMING METADATA OF EDITORIAL SYSTEM S DOCUMENT MANAGEMENT

ISSUES ON FORMING METADATA OF EDITORIAL SYSTEM S DOCUMENT MANAGEMENT ISSN 1392 124X INFORMATION TECHNOLOGY AND CONTROL, 2005, Vol.34, No.4 ISSUES ON FORMING METADATA OF EDITORIAL SYSTEM S DOCUMENT MANAGEMENT Marijus Bernotas, Remigijus Laurutis, Asta Slotkienė Information

More information

Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction

Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction Sentiment Analysis of Movie Reviews and Twitter Statuses Introduction Sentiment analysis is the task of identifying whether the opinion expressed in a text is positive or negative in general, or about

More information

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du fdu@cs.ubc.ca University of British Columbia

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association

More information

Volume 2, Issue 12, December 2014 International Journal of Advance Research in Computer Science and Management Studies

Volume 2, Issue 12, December 2014 International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 12, December 2014 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com

More information