Managing the Knowledge Contained in Electronic Documents: a Clustering Method for Text Mining
|
|
- Prudence Nash
- 8 years ago
- Views:
Transcription
1 Managing the Knowledge Contained in Electronic Documents: a Clustering Method for Text Mining S. Iiritano Getronics S.p.A. Rende (CS), Italy s_iiritano@yahoo.it M. Ruffolo Intersiel S.p.A. Via G. Rossini, Rende (CS), Italy m.ruffolo@telcal.it Abstract The huge amount of unstructured data available on the Web and the intranets creates today an information overloading problem. So, managing the knowledge contained in the textual documents is an important problem of Knowledge Management. Knowledge Extraction from collections of data is possible by Knowledge Discovery in Database (KDD), an interactive and iterative process focused on the exploration of data to discover new and interesting patterns within them. The fundamental phase of KDD process is Data Mining if data are in structured form and Text Mining when they are unstructured. This paper describes a prototype of a vertical corporate portal that implements a KDD process for knowledge extraction from unstructured data contained in textual documents. Text mining is realized through a clustering method that produces a partition of a set of documents on the basis of their contents characterized through the frequency of the words. 1. Introduction Using Knowledge Discovery in Database (KDD), where the fundamental step is Data Mining, knowledge workers can obtain important strategic information for their business. KDD has deeply transformed the methods to interrogate traditional databases, where data are in structured form, by automatically finding new and unknown patterns in huge quantity of data. However, structured data represent only a little part of the overall organization knowledge; in fact the major part of this knowledge is incorporated in textual documents. The amount of unstructured information in this form, accessible through the web, the intranets, the news groups etc. is enormously increased in last years. In this scenario the development of techniques and instruments of Knowledge Extraction, that are able to manage the knowledge contained in electronic textual documents, is a necessary task. This is possible through a KDD process based on Text Mining. A particular Text Mining approach is based on clustering techniques used to group documents according to their content. In this case the Knowledge extraction process is represented by the recognition of these groups. In this paper we present, in Section, a prototype of a vertical corporate portal that implements a KDD process as described above, in which documents are grouped together on the basis of word frequencies. Section 4 contains the results of two experiments carried out on a test corpus composed by articles extracted from some American newspapers and web publications, evaluated using measures defined in Section 3.. A Vertical Corporate Portal for Clustering Textual Documents Clustering methods are techniques for partitioning a set of objects in non-overlapped groups (clusters) on the base of suitable similarity measures. In the literature numerous clustering algorithms can be found [6] as well as a wide variety of similarity coefficients [7]. These techniques can be used in a KDD process to extract knowledge contained in (textual) documents as shown in Fig. 1. This work is supported by Laboratorio per l Innovazione dell Azione Progettuale Ricerca del Piano Telematico Calabria
2 D o c u m e n t A c q u i s i t i o n D o c u m e n t S o u r c e s D o c u m e n t P r e - P r o c e s s i n g R e p o s i t o r y D o c u m e n t C o r p u s T e x t M i n i n g S t r u c t u r e d D o c u m e n t s R e s u l t I n t e r p r e t a t i o n a n d R e f i n e m e n t C l u s t e r i n g K n o w l e d g e E x t r a c t i o n R e s u l t s Figure 1. The KDD process for unstructured data The KDD process is composed by four phases: Document Acquisition Phase: a collection of documents coming from various sources (Internet, company intranet, , etc.) is stored in a repository; Document Pre-processing: documents are submitted to a linguistic pre-processing based on term filtering and context analysis, then an internal representation based on word frequencies is produced; Text Mining: documents are partitioned in clusters; Results Interpretation and Refinement: clusters are submitted to the interpretation and refinement of a human operator. To implement this KDD process we developed a prototype of a Vertical Corporate Portal (VCP), composed by six modules that interact as shown in Fig.. I n t r a n e t VCP is designed to acquire documents internally or externally to the organization. Customer comments and communications, , news groups, manuals and program documentations, trade publications, internal search reports, know-how documents that are resident in the intranets can be submitted to the repository through the Document Collector that point to intranet site or directory to recognize them. Web sites containing information on competitors, market, products, technologies etc. can be acquired with the Crawler, an automatic agent that explores, periodically, selected web sites.. Document Pre-Processing The aim of this phase is to produce, for each input document, an internal representation suitable for text mining phase. The input is a set of m documents, and the output is a set of structured documents, one for each input document. A Document is a sequence of n words (n is the length of the document) and its vocabulary is the set of words V =w 1,,w λ occurring in ; the structured version of, denoted as D, is a set of pairs (w j,f D(w j)) where, for each j=1,,λ, f D(w j) represents the (relative) frequency of the word w j V in the document (i.e. the number of times that w j occurs in normalised w. r. t. the document length). In the VCP architecture the pre-processing phase is carried out by RM in three steps as shown in Fig. 3. W E B D o c u m e n t C o l l e c t o r C r a w l e r R e p o s i t o r y M a n a g e r R e p o s i t o r y C o r p u s S e l e c t o r C o r p u s T e x t C l a s s i f i e r A c q u i s i t i o n P r e - p r o c e s s i n g T e x t M i n i n g C o r p u s P a r t i t i o n Figure. Architecture of the VCP P a r t i t i o n A n a l y z e r R e s u l t s I n t e r p r e t a t i o n a n d R e f i n e m e n t The Crawler and the Document Collector allow the document acquisition phase; the Repository Manager (RM) provides to document pre-processing phase; the Corpus Selector (CS) and Text Classifier (TC) realize the text mining phase; Partition Analyser (PA) makes possible to interpret and to refine the results of the document clustering..1 Document Acquisition D o c u m e n t F i l t e r i n g C o n t e x t A n a l y s i s S t r u c t u r i n g Figure 3. Document pre-processing phase S t r u c t u r e d D o c u m e n t In particular: Filtering: discards from each document the additional words (articles, subjects, pronouns, prepositions) that are not interesting for the analysis; further, the remaining words are reduced to their stems excluding suffixes, prefixes, conjugations of the verbs, plurals. The result of this step is a Filtered document; Normalization: given a filtered document a context analysis is performed by which a synonymous is assigned to each word. The result of this step is a Normalized document. Context analysis is performed, only for English documents, using WordNet [18], developed at Princeton University.
3 WordNet is a lexical database that is able to individualize all meanings (senses) of a given English noun, adjective, verb or adverb finding its polysemies, antonyms, synonymies. The synonymous attributed to the term is the most close sense obtained considering the words in its around (context). However, to allow, also, the treatment of documents written in other languages, VCP is implemented so that context analysis can be excluded; Structuring: given a normalised document a Structured document is produced. At the end of this step RM updates two index files: Doc-Index which structure is (Cod_Doc, Reference, Number of words) and Word-Index which structure is (Word, Cod_Doc, Synonym, Synonym Frequency)..3 Text Mining This phase is carried out in two steps, Corpus Selection and Clustering. Corpus Selection. We call Corpus a set of structured documents. It can be formed by all documents in the repository or by a sub set of them, selected through a query. This step is performed by the Corpus Selector using information retrieval techniques on the files produced by RM. Clustering. The input to this phase is a corpus Ω and the output is a partition P = Γ 1,,Γ k of Ω where each Γ i term is called cluster. A cluster is a set of similar documents and, so, it s the sequence of h (cluster length) words occurring in all documents contained in it. As well as for documents the cluster vocabulary V Γ = w 1,,w η is the set of words occurring in Γ and the structured version C is the set of pairs (w j,f C(w j)), where, for each j=1,,η, f C(w j) represents the (relative) frequency of the word w j V Γ in the cluster Γ (i.e. the number of times w j occurs in Γ normalised w. r. t. the cluster length). P is obtained through a clustering technique in which the similarity coefficient is evaluated on the basis of the structured representation of clusters and documents, considering the frequency of the words included both in the document vocabulary V and in the cluster vocabulary V Γ. In the following we ll describe in detail the similarity measure and the clustering algorithm implemented in VCP..3.1 Similarity Measure. Let W = V V Γ = w 1,,w µ be the set of words present both in the document vocabulary V and in the cluster vocabulary V Γ. The similarity between the document D and the cluster C is measured as: µ ( Θ) S = Φ 1 (1) µ fd ) + fc ) Where measures the k = 1 k = 1 Φ = degree of overlapping of document vocabulary and cluster µ fd ) fc ) vocabulary, andθ = measures k = 1 µ the dissimilarity between common part of the document vocabulary and the cluster vocabulary. Note that S [0,1]. In fact: if V V Γ and f D(w k) = f C(w k) for any w k (k=1,,µ), then Φ=1, Θ=0 and S=1; if V V Γ =, then Φ=0 and S=0..3. Clustering Algorithm. The clustering method is illustrated through the following algorithm, written in a C- like code, based on concepts defined above.
4 Input: A structured corpus Ω Output: A partition P of Ω Initialization: Extracts a document D from Ω; Create a new cluster C 1 containing document D; P = C 1; Iteration: while (Ω ) do extracts document D i from Ω; extracts cluster C 1 from P; maxsimilarity=calculate_similarity(d i, C 1); //CL is a temporary cluster list used during work CL=P- C 1 ; while (CL )do extracts cluster C j from cluster list CL; maxsimilarity=maxmaxsimilarity, Calculate_Similarity(D i,c j); if (maxsimilarity < α) create a new cluster C that contains document D i; P = P C; Re_Control_Clusters(); else j = index of cluster for which maxsimilarity=calculate_similarity(d i, C j) C j = C j D i; In the algorithm are used the following functions: Re_Control_Clusters(). It is performed only when a new cluster is created. In this case all documents already assigned to the other clusters are re-controlled, and for each of them the similarity in comparison to the last produced cluster is determined. If it is greater than those in comparison to the cluster in which the document was assigned, the document is moved in the new cluster; Calculate_Similarity(D i, C j). It receives in input a structured document D i and a structured cluster C j and determines their similarity. The threshold value α was experimentally determined as Result Interpretation and Refinement This phase is realized by PA that shows for each cluster: the cluster length; the number of contained documents; a list of hyperlinks to these documents; a list of most representative words in the cluster ordered by frequency. PA, moreover, guides the user to explore more deeply the clusters and the documents contained for interpreting and refining the text mining results. 3. Measures of Performance: Precision, Coverage, F-Measure Performance evaluation of a clustering method is realized comparing the obtained (real) partition with the ideal one, manually recognized by a human operator that split documents in clusters on the basis of the homogeneity of their contents. In this Section we present some general formulas for performance evaluation of all clustering method. We consider measures referred to single clusters (comparative precision and comparative recall) and measures referred to the whole partition (total precision, total recall, F-Measure). 3.1 Comparative Precision and Comparative Recall Let P = Γ 1,, Γ be an ideal and P = Γ 1,, Γ ρ be a real partition of a corpus Ω. Using the symbol to denote the cardinality of a set, we can measure the comparative precision of the real cluster Γ j w. r. t. the ideal cluster Γ i as: Γ Γ Γ ' j i p ij =. () ' j The comparative recall of the real cluster Γ ' j w.r.t. the ideal cluster Γ i is evaluated as: r ij ' Γ j Γi =. (3) Γ i Comparative precision and comparative recall are used to evaluate total precision and total recall for the whole partition, comparing all real clusters with all ideal ones. 3. Total Precision and Total Recall With total precision and total recall we can evaluate the goodness of obtained partition analysing the composition of the real clusters most close to those ideal.
5 Total precision is measured as: P max( max p ij = 1,..., ρ = 1 = i j Total recall is measured as: R max ρ, ) = ρ = = 1 j 1,..., i r Maximum value of P and R is 1, whereas the minimum value depend of the distribution of objects in the real partition. 3.3 F-Measure The F-Misure [] is a standard metric that combines total precision and total recall into a number that represent the overall performance measure of the clustering method. It is equal to: F ( 1+ β ) = β ij PR P + R So much more F tends to 1, and so much good is the classification. The value of the parameter β establishes the relative importance of the recall in comparison to the precision. The importance of the recall is direcly proportional to the value assumed for β. Tipically performances are evaluated with different values of β; in our experiments we have assumed four values for this parameter: β=1.0 (P and R have the same importance); β=0.5 (R is half important ρ than P); β= (R has a double weight than P); β = (relative importance of R and P depends of the number of obtained clusters). 04. Experimental Results In this section we show results of an experiment carried out on a test corpus composed by 146 documents that represent articles extracted from the principal American newspapers (Boston Globe, Baltimore Sun, Chicago Tribune, Dallas Morning News, Herlad Tribune, Los (4) (5) (6) Angeles Times, New York Times, Washington Post, New York Post, USA Today), and publications on various themes (astronomy, electricity, economy, aerodynamics, etc.) published on internet sites. For this test set the ideal partition is formed by 0 clusters. The experiments was carried out with context analysis (AN-1) and without it (AN-). In Table 1 for each of experiment is shown the number of obtained clusters, the precision P, the recall R, and the F-measure for different values of β. Test AN- 1 AN- N of cluster P R F-Measure β=1 β=0.5 β= β=ρ/ 0,6 7 0,8 4 0,7 4 0,7 0,8 0,75 4 0,6 0,7 0,7 0,68 0,73 0, Table 1. Experimental results As expected, if the context analysis is performed we have better results. 5. Conclusion In this work we shown that knowledge extraction from unstructured data contained in textual documents is possible with a clustering approach, and that the implementation of a web Portal for described KDD process allows to deal with the information overloading problem. The context analysis step and the classification step are realized with heuristics and can be re-designed to improve performances of VCP, as well as it s possible to extend the text mining phase integrating different techniques. Measures proposed in Section 3 are general, and can be used for evaluate performance for all clustering techniques. References [1] Text Mining and the Knowledge Management Space Version, SEMIO Corporation, 1998, California. [] M. Lenz, Managing the Knowledge Contained in Technical Documents, Proc. Of the Second Int. Conf. On Practical Aspects of Knowledge Management (PAKM98), Basel, Switzerland, 9-30 Oct
6 [3] R. Feldman and Al. Knowledge Management: a Text Mining Approach, Proc. Of the Second Int. Conf. On Practical Aspects of Knowledge Management (PAKM98), Basel, Switzerland, 9-30 Oct [4] C. E. Shannon and W. Weaver, La Teoria Matematica delle Comunicazioni, Etas Kompass, [5] B. Everitt, Cluster Analysis, Sage Publication Inc., Beverly Hills, [6] J. A. Hartigan, Clustering Algorithms, John Wiley and Sons, USA, [7] L. Kaufman, P. J. Rousseeuw, Finding Groups in Data, John Wiley and Sons, USA, [8] Doerre, Gersl, Seiffert, Text Mining Finding Nuggets in Mountains of Textual Data, KDD99 proceedings ACM, [9] L. Fahey, Competitors, John Wiley and Sons, USA, [10] C. J. Van Risbergen, W. B. Groft, Documents Clustering: an Evaluation of Some Experiments with the Cranfield Collection, Information Processing and Management, 1975, pp [11] A. Griffiths, H.C. Luckhurst, P. Willet, Using interdocument Similarity Information in Document Retrieval System, Journal of the American Society for Information Science, 1986, vol. 37 pp [1] B. S. Duran, P.L. Odell, Cluster Analysis: a Survey - Springer-Verlag, Berlin, [13] W. J. Frawley, G. Piatesky-Shapiro, C. Matheus, Knowledge Discovery in Databases: an Overview, AI Magazine, 199, pp [14] T. H. Davenport, L. Prusak, Working Knowledge, Boston Harvard Business School Press, [15] W. Eckerson, Analyst Insight Business Portal, June [16] P. D. Henig, Vertical Portals Aim for World Domination, Red Herring Online. [17] D. Gilmore, Some timely guidelines for web design, Mercury News Technology. [18] WordNet: An Electronic Lexical Database, MIT Press. [19] M. Davidson, The Transformation of Management, Butterworth-Heinemann, [0] I. Nonaka, A Dynamic Theory of Organizational Knowledge Creation, Organizational Science, February 1994, Vol. 5 n 1. [1] J. Duncan Davison, Java Servlet API Specification ver..1, Public Review Draft, Sun Microsystem, October [] E. Riloff and W. Lehnert, Information Extraction as a Basis for High-Precision Text Classification, ACM Transaction on Information System, July 1994, vol. 1, No. 3, pp
Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words
, pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan
More informationEffective Data Retrieval Mechanism Using AML within the Web Based Join Framework
Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework Usha Nandini D 1, Anish Gracias J 2 1 ushaduraisamy@yahoo.co.in 2 anishgracias@gmail.com Abstract A vast amount of assorted
More informationClustering Technique in Data Mining for Text Documents
Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor
More informationSearch and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
More informationInteractive Dynamic Information Extraction
Interactive Dynamic Information Extraction Kathrin Eichler, Holmer Hemsen, Markus Löckelt, Günter Neumann, and Norbert Reithinger Deutsches Forschungszentrum für Künstliche Intelligenz - DFKI, 66123 Saarbrücken
More informationHELSINKI UNIVERSITY OF TECHNOLOGY 26.1.2005 T-86.141 Enterprise Systems Integration, 2001. Data warehousing and Data mining: an Introduction
HELSINKI UNIVERSITY OF TECHNOLOGY 26.1.2005 T-86.141 Enterprise Systems Integration, 2001. Data warehousing and Data mining: an Introduction Federico Facca, Alessandro Gallo, federico@grafedi.it sciack@virgilio.it
More informationData Mining Project Report. Document Clustering. Meryem Uzun-Per
Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...
More informationUniversal. Event. Product. Computer. 1 warehouse.
Dynamic multi-dimensional models for text warehouses Maria Zamr Bleyberg, Karthik Ganesh Computing and Information Sciences Department Kansas State University, Manhattan, KS, 66506 Abstract In this paper,
More informationData Mining in Web Search Engine Optimization and User Assisted Rank Results
Data Mining in Web Search Engine Optimization and User Assisted Rank Results Minky Jindal Institute of Technology and Management Gurgaon 122017, Haryana, India Nisha kharb Institute of Technology and Management
More informationDomain Classification of Technical Terms Using the Web
Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using
More informationF. Aiolli - Sistemi Informativi 2007/2008
Text Categorization Text categorization (TC - aka text classification) is the task of buiding text classifiers, i.e. sofware systems that classify documents from a domain D into a given, fixed set C =
More informationSPATIAL DATA CLASSIFICATION AND DATA MINING
, pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal
More informationProjektgruppe. Categorization of text documents via classification
Projektgruppe Steffen Beringer Categorization of text documents via classification 4. Juni 2010 Content Motivation Text categorization Classification in the machine learning Document indexing Construction
More informationLoad Balancing in Structured Peer to Peer Systems
Load Balancing in Structured Peer to Peer Systems DR.K.P.KALIYAMURTHIE 1, D.PARAMESWARI 2 Professor and Head, Dept. of IT, Bharath University, Chennai-600 073 1 Asst. Prof. (SG), Dept. of Computer Applications,
More informationLoad Balancing in Structured Peer to Peer Systems
Load Balancing in Structured Peer to Peer Systems Dr.K.P.Kaliyamurthie 1, D.Parameswari 2 1.Professor and Head, Dept. of IT, Bharath University, Chennai-600 073. 2.Asst. Prof.(SG), Dept. of Computer Applications,
More informationBusiness Intelligence and Decision Support Systems
Chapter 12 Business Intelligence and Decision Support Systems Information Technology For Management 7 th Edition Turban & Volonino Based on lecture slides by L. Beaubien, Providence College John Wiley
More informationWeb Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it
Web Mining Margherita Berardi LACAM Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it Bari, 24 Aprile 2003 Overview Introduction Knowledge discovery from text (Web Content
More informationC o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER
INTRODUCTION TO SAS TEXT MINER TODAY S AGENDA INTRODUCTION TO SAS TEXT MINER Define data mining Overview of SAS Enterprise Miner Describe text analytics and define text data mining Text Mining Process
More informationSustaining Privacy Protection in Personalized Web Search with Temporal Behavior
Sustaining Privacy Protection in Personalized Web Search with Temporal Behavior N.Jagatheshwaran 1 R.Menaka 2 1 Final B.Tech (IT), jagatheshwaran.n@gmail.com, Velalar College of Engineering and Technology,
More informationAn Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
More informationMIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts
MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts Julio Villena-Román 1,3, Sara Lana-Serrano 2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS
More informationBuilding A Smart Academic Advising System Using Association Rule Mining
Building A Smart Academic Advising System Using Association Rule Mining Raed Shatnawi +962795285056 raedamin@just.edu.jo Qutaibah Althebyan +962796536277 qaalthebyan@just.edu.jo Baraq Ghalib & Mohammed
More informationCategorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,
More informationα α λ α = = λ λ α ψ = = α α α λ λ ψ α = + β = > θ θ β > β β θ θ θ β θ β γ θ β = γ θ > β > γ θ β γ = θ β = θ β = θ β = β θ = β β θ = = = β β θ = + α α α α α = = λ λ λ λ λ λ λ = λ λ α α α α λ ψ + α =
More informationFacilitating Business Process Discovery using Email Analysis
Facilitating Business Process Discovery using Email Analysis Matin Mavaddat Matin.Mavaddat@live.uwe.ac.uk Stewart Green Stewart.Green Ian Beeson Ian.Beeson Jin Sa Jin.Sa Abstract Extracting business process
More informationCLUSTERING FOR FORENSIC ANALYSIS
IMPACT: International Journal of Research in Engineering & Technology (IMPACT: IJRET) ISSN(E): 2321-8843; ISSN(P): 2347-4599 Vol. 2, Issue 4, Apr 2014, 129-136 Impact Journals CLUSTERING FOR FORENSIC ANALYSIS
More informationBig Data Text Mining and Visualization. Anton Heijs
Copyright 2007 by Treparel Information Solutions BV. This report nor any part of it may be copied, circulated, quoted without prior written approval from Treparel7 Treparel Information Solutions BV Delftechpark
More informationA Stock Pattern Recognition Algorithm Based on Neural Networks
A Stock Pattern Recognition Algorithm Based on Neural Networks Xinyu Guo guoxinyu@icst.pku.edu.cn Xun Liang liangxun@icst.pku.edu.cn Xiang Li lixiang@icst.pku.edu.cn Abstract pattern respectively. Recent
More informationA Business Process Services Portal
A Business Process Services Portal IBM Research Report RZ 3782 Cédric Favre 1, Zohar Feldman 3, Beat Gfeller 1, Thomas Gschwind 1, Jana Koehler 1, Jochen M. Küster 1, Oleksandr Maistrenko 1, Alexandru
More informationA FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING
A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING Sumit Goswami 1 and Mayank Singh Shishodia 2 1 Indian Institute of Technology-Kharagpur, Kharagpur, India sumit_13@yahoo.com 2 School of Computer
More informationHow To Use Data Mining For Knowledge Management In Technology Enhanced Learning
Proceedings of the 6th WSEAS International Conference on Applications of Electrical Engineering, Istanbul, Turkey, May 27-29, 2007 115 Data Mining for Knowledge Management in Technology Enhanced Learning
More informationDetermining optimal window size for texture feature extraction methods
IX Spanish Symposium on Pattern Recognition and Image Analysis, Castellon, Spain, May 2001, vol.2, 237-242, ISBN: 84-8021-351-5. Determining optimal window size for texture feature extraction methods Domènec
More information131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10
1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom
More informationDynamic Data in terms of Data Mining Streams
International Journal of Computer Science and Software Engineering Volume 2, Number 1 (2015), pp. 1-6 International Research Publication House http://www.irphouse.com Dynamic Data in terms of Data Mining
More informationCloud Storage-based Intelligent Document Archiving for the Management of Big Data
Cloud Storage-based Intelligent Document Archiving for the Management of Big Data Keedong Yoo Dept. of Management Information Systems Dankook University Cheonan, Republic of Korea Abstract : The cloud
More informationdm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING
dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING ABSTRACT In most CRM (Customer Relationship Management) systems, information on
More informationKNOWLEDGE GRID An Architecture for Distributed Knowledge Discovery
KNOWLEDGE GRID An Architecture for Distributed Knowledge Discovery Mario Cannataro 1 and Domenico Talia 2 1 ICAR-CNR 2 DEIS Via P. Bucci, Cubo 41-C University of Calabria 87036 Rende (CS) Via P. Bucci,
More informationData Pre-Processing in Spam Detection
IJSTE - International Journal of Science Technology & Engineering Volume 1 Issue 11 May 2015 ISSN (online): 2349-784X Data Pre-Processing in Spam Detection Anjali Sharma Dr. Manisha Manisha Dr. Rekha Jain
More informationModelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
More informationCross-lingual Synonymy Overlap
Cross-lingual Synonymy Overlap Anca Dinu 1, Liviu P. Dinu 2, Ana Sabina Uban 2 1 Faculty of Foreign Languages and Literatures, University of Bucharest 2 Faculty of Mathematics and Computer Science, University
More informationThe Enron Corpus: A New Dataset for Email Classification Research
The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu
More informationDesigning an Object Relational Data Warehousing System: Project ORDAWA * (Extended Abstract)
Designing an Object Relational Data Warehousing System: Project ORDAWA * (Extended Abstract) Johann Eder 1, Heinz Frank 1, Tadeusz Morzy 2, Robert Wrembel 2, Maciej Zakrzewicz 2 1 Institut für Informatik
More informationHealthcare Measurement Analysis Using Data mining Techniques
www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 03 Issue 07 July, 2014 Page No. 7058-7064 Healthcare Measurement Analysis Using Data mining Techniques 1 Dr.A.Shaik
More informationChapter ML:XI. XI. Cluster Analysis
Chapter ML:XI XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster
More informationAn ontology-based approach for semantic ranking of the web search engines results
An ontology-based approach for semantic ranking of the web search engines results Editor(s): Name Surname, University, Country Solicited review(s): Name Surname, University, Country Open review(s): Name
More informationBig Data: Rethinking Text Visualization
Big Data: Rethinking Text Visualization Dr. Anton Heijs anton.heijs@treparel.com Treparel April 8, 2013 Abstract In this white paper we discuss text visualization approaches and how these are important
More informationInternet of Things, data management for healthcare applications. Ontology and automatic classifications
Internet of Things, data management for healthcare applications. Ontology and automatic classifications Inge.Krogstad@nor.sas.com SAS Institute Norway Different challenges same opportunities! Data capture
More informationCollecting Polish German Parallel Corpora in the Internet
Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska
More informationPoS-tagging Italian texts with CORISTagger
PoS-tagging Italian texts with CORISTagger Fabio Tamburini DSLO, University of Bologna, Italy fabio.tamburini@unibo.it Abstract. This paper presents an evolution of CORISTagger [1], an high-performance
More informationData mining and complex telecommunications problems modeling
Paper Data mining and complex telecommunications problems modeling Janusz Granat Abstract The telecommunications operators have to manage one of the most complex systems developed by human beings. Moreover,
More informationTwitter Analytics: Architecture, Tools and Analysis
Twitter Analytics: Architecture, Tools and Analysis Rohan D.W Perera CERDEC Ft Monmouth, NJ 07703-5113 S. Anand, K. P. Subbalakshmi and R. Chandramouli Department of ECE, Stevens Institute Of Technology
More informationError Log Processing for Accurate Failure Prediction. Humboldt-Universität zu Berlin
Error Log Processing for Accurate Failure Prediction Felix Salfner ICSI Berkeley Steffen Tschirpke Humboldt-Universität zu Berlin Introduction Context of work: Error-based online failure prediction: error
More informationAn Information Retrieval using weighted Index Terms in Natural Language document collections
Internet and Information Technology in Modern Organizations: Challenges & Answers 635 An Information Retrieval using weighted Index Terms in Natural Language document collections Ahmed A. A. Radwan, Minia
More informationMining Signatures in Healthcare Data Based on Event Sequences and its Applications
Mining Signatures in Healthcare Data Based on Event Sequences and its Applications Siddhanth Gokarapu 1, J. Laxmi Narayana 2 1 Student, Computer Science & Engineering-Department, JNTU Hyderabad India 1
More informationSupply chain management by means of FLM-rules
Supply chain management by means of FLM-rules Nicolas Le Normand, Julien Boissière, Nicolas Méger, Lionel Valet LISTIC Laboratory - Polytech Savoie Université de Savoie B.P. 80439 F-74944 Annecy-Le-Vieux,
More informationDeposit Identification Utility and Visualization Tool
Deposit Identification Utility and Visualization Tool Colorado School of Mines Field Session Summer 2014 David Alexander Jeremy Kerr Luke McPherson Introduction Newmont Mining Corporation was founded in
More informationLluis Belanche + Alfredo Vellido. Intelligent Data Analysis and Data Mining
Lluis Belanche + Alfredo Vellido Intelligent Data Analysis and Data Mining a.k.a. Data Mining II Office 319, Omega, BCN EET, office 107, TR 2, Terrassa avellido@lsi.upc.edu skype, gtalk: avellido Tels.:
More informationData Mining: A Preprocessing Engine
Journal of Computer Science 2 (9): 735-739, 2006 ISSN 1549-3636 2005 Science Publications Data Mining: A Preprocessing Engine Luai Al Shalabi, Zyad Shaaban and Basel Kasasbeh Applied Science University,
More informationA Framework for Intelligent Online Customer Service System
A Framework for Intelligent Online Customer Service System Yiping WANG Yongjin ZHANG School of Business Administration, Xi an University of Technology Abstract: In a traditional customer service support
More informationStock Market Prediction Using Data Mining
Stock Market Prediction Using Data Mining 1 Ruchi Desai, 2 Prof.Snehal Gandhi 1 M.E., 2 M.Tech. 1 Computer Department 1 Sarvajanik College of Engineering and Technology, Surat, Gujarat, India Abstract
More informationOpen Domain Information Extraction. Günter Neumann, DFKI, 2012
Open Domain Information Extraction Günter Neumann, DFKI, 2012 Improving TextRunner Wu and Weld (2010) Open Information Extraction using Wikipedia, ACL 2010 Fader et al. (2011) Identifying Relations for
More informationOptimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2
Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2 Department of Computer Engineering, YMCA University of Science & Technology, Faridabad,
More informationNEW TECHNIQUE TO DEAL WITH DYNAMIC DATA MINING IN THE DATABASE
www.arpapress.com/volumes/vol13issue3/ijrras_13_3_18.pdf NEW TECHNIQUE TO DEAL WITH DYNAMIC DATA MINING IN THE DATABASE Hebah H. O. Nasereddin Middle East University, P.O. Box: 144378, Code 11814, Amman-Jordan
More informationjeti: A Tool for Remote Tool Integration
jeti: A Tool for Remote Tool Integration Tiziana Margaria 1, Ralf Nagel 2, and Bernhard Steffen 2 1 Service Engineering for Distributed Systems, Institute for Informatics, University of Göttingen, Germany
More informationThree types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.
Chronological Sampling for Email Filtering Ching-Lung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada
More informationCustomer Intentions Analysis of Twitter Based on Semantic Patterns
Customer Intentions Analysis of Twitter Based on Semantic Patterns Mohamed Hamroun mohamed.hamrounn@gmail.com Mohamed Salah Gouider ms.gouider@yahoo.fr Lamjed Ben Said lamjed.bensaid@isg.rnu.tn ABSTRACT
More informationCourse Design Document. IS414: Search Engine Technologies
Course Design Document IS414: Search Engine Technologies Version 2.7 6 June 2011 IS414 Search Engine Technologies Page 1 Table of Contents 1. Revision History... 3 2. Overview of the Search Engine Technologies
More informationA Survey of Text Mining Techniques and Applications
60 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 1, NO. 1, AUGUST 2009 A Survey of Text Mining Techniques and Applications Vishal Gupta Lecturer Computer Science & Engineering, University
More informationMining Text Data: An Introduction
Bölüm 10. Metin ve WEB Madenciliği http://ceng.gazi.edu.tr/~ozdemir Mining Text Data: An Introduction Data Mining / Knowledge Discovery Structured Data Multimedia Free Text Hypertext HomeLoan ( Frank Rizzo
More informationEXPLOITING FOLKSONOMIES AND ONTOLOGIES IN AN E-BUSINESS APPLICATION
EXPLOITING FOLKSONOMIES AND ONTOLOGIES IN AN E-BUSINESS APPLICATION Anna Goy and Diego Magro Dipartimento di Informatica, Università di Torino C. Svizzera, 185, I-10149 Italy ABSTRACT This paper proposes
More informationWord Taxonomy for On-line Visual Asset Management and Mining
Word Taxonomy for On-line Visual Asset Management and Mining Osmar R. Zaïane * Eli Hagen ** Jiawei Han ** * Department of Computing Science, University of Alberta, Canada, zaiane@cs.uaberta.ca ** School
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationImportance or the Role of Data Warehousing and Data Mining in Business Applications
Journal of The International Association of Advanced Technology and Science Importance or the Role of Data Warehousing and Data Mining in Business Applications ATUL ARORA ANKIT MALIK Abstract Information
More informationNatural Language to Relational Query by Using Parsing Compiler
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
More informationAutomating Legal Research through Data Mining
Automating Legal Research through Data Mining M.F.M Firdhous, Faculty of Information Technology, University of Moratuwa, Moratuwa, Sri Lanka. Mohamed.Firdhous@uom.lk Abstract The term legal research generally
More informationStandardization of Components, Products and Processes with Data Mining
B. Agard and A. Kusiak, Standardization of Components, Products and Processes with Data Mining, International Conference on Production Research Americas 2004, Santiago, Chile, August 1-4, 2004. Standardization
More informationUnderstanding Web personalization with Web Usage Mining and its Application: Recommender System
Understanding Web personalization with Web Usage Mining and its Application: Recommender System Manoj Swami 1, Prof. Manasi Kulkarni 2 1 M.Tech (Computer-NIMS), VJTI, Mumbai. 2 Department of Computer Technology,
More informationInteractive Recovery of Requirements Traceability Links Using User Feedback and Configuration Management Logs
Interactive Recovery of Requirements Traceability Links Using User Feedback and Configuration Management Logs Ryosuke Tsuchiya 1, Hironori Washizaki 1, Yoshiaki Fukazawa 1, Keishi Oshima 2, and Ryota Mibe
More informationPersonalization of Web Search With Protected Privacy
Personalization of Web Search With Protected Privacy S.S DIVYA, R.RUBINI,P.EZHIL Final year, Information Technology,KarpagaVinayaga College Engineering and Technology, Kanchipuram [D.t] Final year, Information
More informationStandardization and Its Effects on K-Means Clustering Algorithm
Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03
More informationPartJoin: An Efficient Storage and Query Execution for Data Warehouses
PartJoin: An Efficient Storage and Query Execution for Data Warehouses Ladjel Bellatreche 1, Michel Schneider 2, Mukesh Mohania 3, and Bharat Bhargava 4 1 IMERIR, Perpignan, FRANCE ladjel@imerir.com 2
More informationKnowledge Management
Knowledge Management Management Information Code: 164292-02 Course: Management Information Period: Autumn 2013 Professor: Sync Sangwon Lee, Ph. D D. of Information & Electronic Commerce 1 00. Contents
More informationBuilding a Question Classifier for a TREC-Style Question Answering System
Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given
More informationScalable Parallel Clustering for Data Mining on Multicomputers
Scalable Parallel Clustering for Data Mining on Multicomputers D. Foti, D. Lipari, C. Pizzuti and D. Talia ISI-CNR c/o DEIS, UNICAL 87036 Rende (CS), Italy {pizzuti,talia}@si.deis.unical.it Abstract. This
More informationDiscretization and grouping: preprocessing steps for Data Mining
Discretization and grouping: preprocessing steps for Data Mining PetrBerka 1 andivanbruha 2 1 LaboratoryofIntelligentSystems Prague University of Economic W. Churchill Sq. 4, Prague CZ 13067, Czech Republic
More informationRole of Text Mining in Business Intelligence
Role of Text Mining in Business Intelligence Palak Gupta 1, Barkha Narang 2 Abstract This paper includes the combined study of business intelligence and text mining of uncertain data. The data that is
More informationKnowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
More informationOnline Failure Prediction in Cloud Datacenters
Online Failure Prediction in Cloud Datacenters Yukihiro Watanabe Yasuhide Matsumoto Once failures occur in a cloud datacenter accommodating a large number of virtual resources, they tend to spread rapidly
More informationBlog Post Extraction Using Title Finding
Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School
More informationEnsemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
More informationTOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam
More informationLess naive Bayes spam detection
Less naive Bayes spam detection Hongming Yang Eindhoven University of Technology Dept. EE, Rm PT 3.27, P.O.Box 53, 5600MB Eindhoven The Netherlands. E-mail:h.m.yang@tue.nl also CoSiNe Connectivity Systems
More informationImplementing Heuristic Miner for Different Types of Event Logs
Implementing Heuristic Miner for Different Types of Event Logs Angelina Prima Kurniati 1, GunturPrabawa Kusuma 2, GedeAgungAry Wisudiawan 3 1,3 School of Compuing, Telkom University, Indonesia. 2 School
More informationSemantic annotation of requirements for automatic UML class diagram generation
www.ijcsi.org 259 Semantic annotation of requirements for automatic UML class diagram generation Soumaya Amdouni 1, Wahiba Ben Abdessalem Karaa 2 and Sondes Bouabid 3 1 University of tunis High Institute
More informationInternational Journal of Scientific & Engineering Research, Volume 5, Issue 4, April-2014 442 ISSN 2229-5518
International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April-2014 442 Over viewing issues of data mining with highlights of data warehousing Rushabh H. Baldaniya, Prof H.J.Baldaniya,
More informationTerm extraction for user profiling: evaluation by the user
Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,
More informationMining the Software Change Repository of a Legacy Telephony System
Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,
More informationPredicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
More informationAn Analysis on Density Based Clustering of Multi Dimensional Spatial Data
An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,
More informationWeb Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
More information