Managing the Knowledge Contained in Electronic Documents: a Clustering Method for Text Mining

Transcription

1 Managing the Knowledge Contained in Electronic Documents: a Clustering Method for Text Mining S. Iiritano Getronics S.p.A. Rende (CS), Italy s_iiritano@yahoo.it M. Ruffolo Intersiel S.p.A. Via G. Rossini, Rende (CS), Italy m.ruffolo@telcal.it Abstract The huge amount of unstructured data available on the Web and the intranets creates today an information overloading problem. So, managing the knowledge contained in the textual documents is an important problem of Knowledge Management. Knowledge Extraction from collections of data is possible by Knowledge Discovery in Database (KDD), an interactive and iterative process focused on the exploration of data to discover new and interesting patterns within them. The fundamental phase of KDD process is Data Mining if data are in structured form and Text Mining when they are unstructured. This paper describes a prototype of a vertical corporate portal that implements a KDD process for knowledge extraction from unstructured data contained in textual documents. Text mining is realized through a clustering method that produces a partition of a set of documents on the basis of their contents characterized through the frequency of the words. 1. Introduction Using Knowledge Discovery in Database (KDD), where the fundamental step is Data Mining, knowledge workers can obtain important strategic information for their business. KDD has deeply transformed the methods to interrogate traditional databases, where data are in structured form, by automatically finding new and unknown patterns in huge quantity of data. However, structured data represent only a little part of the overall organization knowledge; in fact the major part of this knowledge is incorporated in textual documents. The amount of unstructured information in this form, accessible through the web, the intranets, the news groups etc. is enormously increased in last years. In this scenario the development of techniques and instruments of Knowledge Extraction, that are able to manage the knowledge contained in electronic textual documents, is a necessary task. This is possible through a KDD process based on Text Mining. A particular Text Mining approach is based on clustering techniques used to group documents according to their content. In this case the Knowledge extraction process is represented by the recognition of these groups. In this paper we present, in Section, a prototype of a vertical corporate portal that implements a KDD process as described above, in which documents are grouped together on the basis of word frequencies. Section 4 contains the results of two experiments carried out on a test corpus composed by articles extracted from some American newspapers and web publications, evaluated using measures defined in Section 3.. A Vertical Corporate Portal for Clustering Textual Documents Clustering methods are techniques for partitioning a set of objects in non-overlapped groups (clusters) on the base of suitable similarity measures. In the literature numerous clustering algorithms can be found [6] as well as a wide variety of similarity coefficients [7]. These techniques can be used in a KDD process to extract knowledge contained in (textual) documents as shown in Fig. 1. This work is supported by Laboratorio per l Innovazione dell Azione Progettuale Ricerca del Piano Telematico Calabria

2 D o c u m e n t A c q u i s i t i o n D o c u m e n t S o u r c e s D o c u m e n t P r e - P r o c e s s i n g R e p o s i t o r y D o c u m e n t C o r p u s T e x t M i n i n g S t r u c t u r e d D o c u m e n t s R e s u l t I n t e r p r e t a t i o n a n d R e f i n e m e n t C l u s t e r i n g K n o w l e d g e E x t r a c t i o n R e s u l t s Figure 1. The KDD process for unstructured data The KDD process is composed by four phases: Document Acquisition Phase: a collection of documents coming from various sources (Internet, company intranet, , etc.) is stored in a repository; Document Pre-processing: documents are submitted to a linguistic pre-processing based on term filtering and context analysis, then an internal representation based on word frequencies is produced; Text Mining: documents are partitioned in clusters; Results Interpretation and Refinement: clusters are submitted to the interpretation and refinement of a human operator. To implement this KDD process we developed a prototype of a Vertical Corporate Portal (VCP), composed by six modules that interact as shown in Fig.. I n t r a n e t VCP is designed to acquire documents internally or externally to the organization. Customer comments and communications, , news groups, manuals and program documentations, trade publications, internal search reports, know-how documents that are resident in the intranets can be submitted to the repository through the Document Collector that point to intranet site or directory to recognize them. Web sites containing information on competitors, market, products, technologies etc. can be acquired with the Crawler, an automatic agent that explores, periodically, selected web sites.. Document Pre-Processing The aim of this phase is to produce, for each input document, an internal representation suitable for text mining phase. The input is a set of m documents, and the output is a set of structured documents, one for each input document. A Document is a sequence of n words (n is the length of the document) and its vocabulary is the set of words V =w 1,,w λ occurring in ; the structured version of, denoted as D, is a set of pairs (w j,f D(w j)) where, for each j=1,,λ, f D(w j) represents the (relative) frequency of the word w j V in the document (i.e. the number of times that w j occurs in normalised w. r. t. the document length). In the VCP architecture the pre-processing phase is carried out by RM in three steps as shown in Fig. 3. W E B D o c u m e n t C o l l e c t o r C r a w l e r R e p o s i t o r y M a n a g e r R e p o s i t o r y C o r p u s S e l e c t o r C o r p u s T e x t C l a s s i f i e r A c q u i s i t i o n P r e - p r o c e s s i n g T e x t M i n i n g C o r p u s P a r t i t i o n Figure. Architecture of the VCP P a r t i t i o n A n a l y z e r R e s u l t s I n t e r p r e t a t i o n a n d R e f i n e m e n t The Crawler and the Document Collector allow the document acquisition phase; the Repository Manager (RM) provides to document pre-processing phase; the Corpus Selector (CS) and Text Classifier (TC) realize the text mining phase; Partition Analyser (PA) makes possible to interpret and to refine the results of the document clustering..1 Document Acquisition D o c u m e n t F i l t e r i n g C o n t e x t A n a l y s i s S t r u c t u r i n g Figure 3. Document pre-processing phase S t r u c t u r e d D o c u m e n t In particular: Filtering: discards from each document the additional words (articles, subjects, pronouns, prepositions) that are not interesting for the analysis; further, the remaining words are reduced to their stems excluding suffixes, prefixes, conjugations of the verbs, plurals. The result of this step is a Filtered document; Normalization: given a filtered document a context analysis is performed by which a synonymous is assigned to each word. The result of this step is a Normalized document. Context analysis is performed, only for English documents, using WordNet [18], developed at Princeton University.

3 WordNet is a lexical database that is able to individualize all meanings (senses) of a given English noun, adjective, verb or adverb finding its polysemies, antonyms, synonymies. The synonymous attributed to the term is the most close sense obtained considering the words in its around (context). However, to allow, also, the treatment of documents written in other languages, VCP is implemented so that context analysis can be excluded; Structuring: given a normalised document a Structured document is produced. At the end of this step RM updates two index files: Doc-Index which structure is (Cod_Doc, Reference, Number of words) and Word-Index which structure is (Word, Cod_Doc, Synonym, Synonym Frequency)..3 Text Mining This phase is carried out in two steps, Corpus Selection and Clustering. Corpus Selection. We call Corpus a set of structured documents. It can be formed by all documents in the repository or by a sub set of them, selected through a query. This step is performed by the Corpus Selector using information retrieval techniques on the files produced by RM. Clustering. The input to this phase is a corpus Ω and the output is a partition P = Γ 1,,Γ k of Ω where each Γ i term is called cluster. A cluster is a set of similar documents and, so, it s the sequence of h (cluster length) words occurring in all documents contained in it. As well as for documents the cluster vocabulary V Γ = w 1,,w η is the set of words occurring in Γ and the structured version C is the set of pairs (w j,f C(w j)), where, for each j=1,,η, f C(w j) represents the (relative) frequency of the word w j V Γ in the cluster Γ (i.e. the number of times w j occurs in Γ normalised w. r. t. the cluster length). P is obtained through a clustering technique in which the similarity coefficient is evaluated on the basis of the structured representation of clusters and documents, considering the frequency of the words included both in the document vocabulary V and in the cluster vocabulary V Γ. In the following we ll describe in detail the similarity measure and the clustering algorithm implemented in VCP..3.1 Similarity Measure. Let W = V V Γ = w 1,,w µ be the set of words present both in the document vocabulary V and in the cluster vocabulary V Γ. The similarity between the document D and the cluster C is measured as: µ ( Θ) S = Φ 1 (1) µ fd ) + fc ) Where measures the k = 1 k = 1 Φ = degree of overlapping of document vocabulary and cluster µ fd ) fc ) vocabulary, andθ = measures k = 1 µ the dissimilarity between common part of the document vocabulary and the cluster vocabulary. Note that S [0,1]. In fact: if V V Γ and f D(w k) = f C(w k) for any w k (k=1,,µ), then Φ=1, Θ=0 and S=1; if V V Γ =, then Φ=0 and S=0..3. Clustering Algorithm. The clustering method is illustrated through the following algorithm, written in a C- like code, based on concepts defined above.

4 Input: A structured corpus Ω Output: A partition P of Ω Initialization: Extracts a document D from Ω; Create a new cluster C 1 containing document D; P = C 1; Iteration: while (Ω ) do extracts document D i from Ω; extracts cluster C 1 from P; maxsimilarity=calculate_similarity(d i, C 1); //CL is a temporary cluster list used during work CL=P- C 1 ; while (CL )do extracts cluster C j from cluster list CL; maxsimilarity=maxmaxsimilarity, Calculate_Similarity(D i,c j); if (maxsimilarity < α) create a new cluster C that contains document D i; P = P C; Re_Control_Clusters(); else j = index of cluster for which maxsimilarity=calculate_similarity(d i, C j) C j = C j D i; In the algorithm are used the following functions: Re_Control_Clusters(). It is performed only when a new cluster is created. In this case all documents already assigned to the other clusters are re-controlled, and for each of them the similarity in comparison to the last produced cluster is determined. If it is greater than those in comparison to the cluster in which the document was assigned, the document is moved in the new cluster; Calculate_Similarity(D i, C j). It receives in input a structured document D i and a structured cluster C j and determines their similarity. The threshold value α was experimentally determined as Result Interpretation and Refinement This phase is realized by PA that shows for each cluster: the cluster length; the number of contained documents; a list of hyperlinks to these documents; a list of most representative words in the cluster ordered by frequency. PA, moreover, guides the user to explore more deeply the clusters and the documents contained for interpreting and refining the text mining results. 3. Measures of Performance: Precision, Coverage, F-Measure Performance evaluation of a clustering method is realized comparing the obtained (real) partition with the ideal one, manually recognized by a human operator that split documents in clusters on the basis of the homogeneity of their contents. In this Section we present some general formulas for performance evaluation of all clustering method. We consider measures referred to single clusters (comparative precision and comparative recall) and measures referred to the whole partition (total precision, total recall, F-Measure). 3.1 Comparative Precision and Comparative Recall Let P = Γ 1,, Γ be an ideal and P = Γ 1,, Γ ρ be a real partition of a corpus Ω. Using the symbol to denote the cardinality of a set, we can measure the comparative precision of the real cluster Γ j w. r. t. the ideal cluster Γ i as: Γ Γ Γ ' j i p ij =. () ' j The comparative recall of the real cluster Γ ' j w.r.t. the ideal cluster Γ i is evaluated as: r ij ' Γ j Γi =. (3) Γ i Comparative precision and comparative recall are used to evaluate total precision and total recall for the whole partition, comparing all real clusters with all ideal ones. 3. Total Precision and Total Recall With total precision and total recall we can evaluate the goodness of obtained partition analysing the composition of the real clusters most close to those ideal.

5 Total precision is measured as: P max( max p ij = 1,..., ρ = 1 = i j Total recall is measured as: R max ρ, ) = ρ = = 1 j 1,..., i r Maximum value of P and R is 1, whereas the minimum value depend of the distribution of objects in the real partition. 3.3 F-Measure The F-Misure [] is a standard metric that combines total precision and total recall into a number that represent the overall performance measure of the clustering method. It is equal to: F ( 1+ β ) = β ij PR P + R So much more F tends to 1, and so much good is the classification. The value of the parameter β establishes the relative importance of the recall in comparison to the precision. The importance of the recall is direcly proportional to the value assumed for β. Tipically performances are evaluated with different values of β; in our experiments we have assumed four values for this parameter: β=1.0 (P and R have the same importance); β=0.5 (R is half important ρ than P); β= (R has a double weight than P); β = (relative importance of R and P depends of the number of obtained clusters). 04. Experimental Results In this section we show results of an experiment carried out on a test corpus composed by 146 documents that represent articles extracted from the principal American newspapers (Boston Globe, Baltimore Sun, Chicago Tribune, Dallas Morning News, Herlad Tribune, Los (4) (5) (6) Angeles Times, New York Times, Washington Post, New York Post, USA Today), and publications on various themes (astronomy, electricity, economy, aerodynamics, etc.) published on internet sites. For this test set the ideal partition is formed by 0 clusters. The experiments was carried out with context analysis (AN-1) and without it (AN-). In Table 1 for each of experiment is shown the number of obtained clusters, the precision P, the recall R, and the F-measure for different values of β. Test AN- 1 AN- N of cluster P R F-Measure β=1 β=0.5 β= β=ρ/ 0,6 7 0,8 4 0,7 4 0,7 0,8 0,75 4 0,6 0,7 0,7 0,68 0,73 0, Table 1. Experimental results As expected, if the context analysis is performed we have better results. 5. Conclusion In this work we shown that knowledge extraction from unstructured data contained in textual documents is possible with a clustering approach, and that the implementation of a web Portal for described KDD process allows to deal with the information overloading problem. The context analysis step and the classification step are realized with heuristics and can be re-designed to improve performances of VCP, as well as it s possible to extend the text mining phase integrating different techniques. Measures proposed in Section 3 are general, and can be used for evaluate performance for all clustering techniques. References [1] Text Mining and the Knowledge Management Space Version, SEMIO Corporation, 1998, California. [] M. Lenz, Managing the Knowledge Contained in Technical Documents, Proc. Of the Second Int. Conf. On Practical Aspects of Knowledge Management (PAKM98), Basel, Switzerland, 9-30 Oct

6 [3] R. Feldman and Al. Knowledge Management: a Text Mining Approach, Proc. Of the Second Int. Conf. On Practical Aspects of Knowledge Management (PAKM98), Basel, Switzerland, 9-30 Oct [4] C. E. Shannon and W. Weaver, La Teoria Matematica delle Comunicazioni, Etas Kompass, [5] B. Everitt, Cluster Analysis, Sage Publication Inc., Beverly Hills, [6] J. A. Hartigan, Clustering Algorithms, John Wiley and Sons, USA, [7] L. Kaufman, P. J. Rousseeuw, Finding Groups in Data, John Wiley and Sons, USA, [8] Doerre, Gersl, Seiffert, Text Mining Finding Nuggets in Mountains of Textual Data, KDD99 proceedings ACM, [9] L. Fahey, Competitors, John Wiley and Sons, USA, [10] C. J. Van Risbergen, W. B. Groft, Documents Clustering: an Evaluation of Some Experiments with the Cranfield Collection, Information Processing and Management, 1975, pp [11] A. Griffiths, H.C. Luckhurst, P. Willet, Using interdocument Similarity Information in Document Retrieval System, Journal of the American Society for Information Science, 1986, vol. 37 pp [1] B. S. Duran, P.L. Odell, Cluster Analysis: a Survey - Springer-Verlag, Berlin, [13] W. J. Frawley, G. Piatesky-Shapiro, C. Matheus, Knowledge Discovery in Databases: an Overview, AI Magazine, 199, pp [14] T. H. Davenport, L. Prusak, Working Knowledge, Boston Harvard Business School Press, [15] W. Eckerson, Analyst Insight Business Portal, June [16] P. D. Henig, Vertical Portals Aim for World Domination, Red Herring Online. [17] D. Gilmore, Some timely guidelines for web design, Mercury News Technology. [18] WordNet: An Electronic Lexical Database, MIT Press. [19] M. Davidson, The Transformation of Management, Butterworth-Heinemann, [0] I. Nonaka, A Dynamic Theory of Organizational Knowledge Creation, Organizational Science, February 1994, Vol. 5 n 1. [1] J. Duncan Davison, Java Servlet API Specification ver..1, Public Review Draft, Sun Microsystem, October [] E. Riloff and W. Lehnert, Information Extraction as a Basis for High-Precision Text Classification, ACM Transaction on Information System, July 1994, vol. 1, No. 3, pp