An improved TF-IDF approach for text classification *

Zhang et al. / J Zheiang Univ SCI 2005 6A(1:49-55 49 Journal of Zheiang University SCIECE ISS 1009-3095 http://www.zu.edu.cn/zus E-ail: zus@zu.edu.cn An iproved TF-IDF approach for text classification * ZHAG Yun-tao ( 张云涛 1,2, GOG Ling ( 龚玲 2, WAG Yong-cheng ( 王永成 2 ( 1 etwork & Inforation Center, 2 School of Electronic & Inforation Technology, Shanghai Jiaotong University, Shanghai 200030, China E-ail: ytzhang@ail.stu.edu.cn; lgong@ail.stu.edu.cn; ycwang@ail.stu.edu.cn Received Dec. 5, 2003; revision accepted June 26, 2004 Abstract: This paper presents a new iproved ter frequency/inverse docuent frequency (TF-IDF approach which uses confidence, support and characteristic words to enhance the recall and precision of text classification. Synonys defined by a lexicon are processed in the iproved TF-IDF approach. We detailedly discuss and analyze the relationship aong confidence, recall and precision. The experients based on science and technology gave proising results that the new TF-IDF approach iproves the precision and recall of text classification copared with the conventional TF-IDF approach. Key words: Ter frequency/inverse docuent frequency (TF-IDF, Text classification, Confidence, Support, Characteristic words doi:10.1631/zus.2005.a0049 Docuent code: A CLC nuber: TP31 ITRODUCTIO The widespread and increasing availability of text docuents in electronic for increases the iportance of using autoatic ethods to analyze the content of text docuents, because the ethod using doain experts to identify new text docuents and allocate the to well-defined categories is tie-consuing and expensive, has liits, and does not provide continuous easure of the degree of confidence with which the allocation was ade (Olivier, 2000. As a result, the identification and classification of text docuents based on their contents are becoing iperative. The classification can be done autoatically by separate classifiers learning fro training saples of text docuents. The ain ai of the classifier is to obtain a set of characteristics that reain relatively constant for separate categories of text docuents and to classify the huge nuber of text docuents into soe particular categories (or folders containing * Proect (o. 60082003 supported by the ational atural Science Foundation of China ultiple related text docuents. In text classification, a text docuent ay partially atch any categories. We need to find the best atching category for the text docuent. The ter (word frequency/inverse docuent frequency (TF-IDF approach is coonly used to weigh each word in the text docuent according to how unique it is. In other words, the TF-IDF approach captures the relevancy aong words, text docuents and particular categories. We put forward the novel iproved TF-IDF approach for text classification, and will focus on this approach in the reainder of this paper, and will describe in detail the otivation, ethodology, and ipleentation of the iproved TF-IDF approach. The paper discusses and analyzes the relationship aong confidence, support, recall and precision, and then presents the experiental results. IMPROVED TF-IDF APPROACH Text classification can be effected by various learning approaches of classifier, such as k-nearest neighbor (Sun et al., 2001, decision tree induction,

50 Zhang et al. / J Zheiang Univ SCI 2005 6A(1:49-55 naïve Bayesian (Fan et al., 2001, support vector achine (Huang and Wu, 1998; Larry and Malik, 2001 and latent seantic index (Lin et al., 2000. Soe of these techniques are based on, or correlated with, the TF-IDF approach representing text with vector space in which each feature in the text corresponds to a single word. VSM assues that a text docuent d i is represented by a set of words (t 1, t 2,, t n wherein each t i is a word that appears in the text docuent d i, and n denotes the total nuber of various words used to identify the eaning of the text docuent. Word t i has a corresponding weight w i calculated as a cobination of the statistics TF(w i, d i and IDF(w i. Therefore, d i can be represented as a specific n-diensional vector d i as d =( w, w,..., w (1 i 1 2 n Weight is the easure that indicates the statistical iportance of corresponding words. The weight w i of word t i can be deterined by the value of TF(w i, d i *IDF(w i. The TF value is proportional to the frequency of the word in the docuent and the IDF value is inversely proportional to its frequency in the docuent corpus. The function encodes the intuitions that: (1 The ore often a word occurs in a docuent, the ore it is representative of the content of the text; (2 The ore text the word occurs in, the less discriinating it is (Fabrizio, 2002. Per word vector d i coonly contains a lot of vector eleents. In order to reduce the vector diension, we shall select eleents by calculating the value of TF*IDF for each eleent in the vector representation. The words selected as vector eleents are called feature words by us. In docuents, the higher-frequency words are ore iportant for representing the content than lower-frequency words. However, soe high-frequency words such as the, for, at having low content discriinating power are listed at the stop-list. The Chinese stop words were partially discussed by Wang (1992. It is clear those words appearing in the stop-list will be deleted fro the set of feature words in order to reduce the aount of diensions and enhance the relevancy between words and docuents or categories. Another preprocessing procedure is steing. In steing, each word is regarded as word-ste for in docuents. For exaple, developent, developed and developing will be all treated as develop. Siilarly to stop-list, steing reduces the aount of diensions and enhances the relevancy between word and docuent or categories. We observed that the author probably uses various words to express the sae/siilar concept in a text docuent. Moreover, a particular category consists of a set of docuents produced by different authors, each of who has different personal traits and stylistic features. Therefore, it is coon that siilar contents in the sae category are expressed with various words by authors. To reduce the aount of diensions and further enhance the relevancy between word and docuent and the relevancy between word and category, synonys are defined by an experiental lexicon constructed by us and all synonys will be processed and considered as the sae word. The calculation of TF, IDF and word weight was discussed by Salton and Buckley (1988, and Salton (1991. Thorsten (1996 analyzed the relationship between text classification using the vector space odel with TF-IDF weight and probabilistic classifiers. The analysis offered theoretical explanation for the TF-IDF word weight and gave insight into the underlying assuptions. These papers consider identically all the words for the text classification. However, we observe that soe words play a doinant role for soe particular categories. These words are called feature word in this paper. In this case, the likelihood that the docuent belongs to a particular category is very high when the docuent contains feature words. The iproved TF-IDF approach uses the feature words to iprove the accuracy of text classification. We initiated two new ters, confidence and support, into the TF-IDF approach. The ters confidence and support were first used in data ining discipline. However, they have soe new and concrete eaning in our approach. Above all, we need to define several variables. We represent docuent d i as a feature word vector d i, in which each coponent w (d i represents the frequency that the word w appears in docuent d i. In addition: (all, c represents the total nuber of docuents aong a particular category c.

Zhang et al. / J Zheiang Univ SCI 2005 6A(1:49-55 51 (w, all represents the total nuber of docuents containing feature word w aong the entire training docuents corpus coposed of all categories. (w, c represents the total nuber of docuents containing feature word w aong the particular category c. (w, c represents the total nuber of docuents containing feature word w aong the entire training docuents corpus except the particular category c. (all represents the total nuber of docuents aong the entire training docuents corpus. We define conf(w, c as the confidence degree of feature word w on the particular category c. The value of conf(w, c equals the quotient of the total nuber of docuents containing feature word w aong the particular category c divided by the total nuber of docuents containing feature word w aong the entire training docuents corpus coposed by all categories, as follows conf(w, c = w (, c w (, all (2 We define sup(w as the support of feature word w. The value sup(w equals the quotient of the total nuber of docuents containing feature word w aong the entire training docuents corpus divided by the total nuber of docuents aong the entire training docuents corpus, as follows w (, all sup(w = all ( (3 where 0<conf(w, c 1 (4 0<sup(w 1 (5 Confidence is the easure of certainty to deterine a particular category by a particular feature word. The potential usefulness of a particular feature word is represented by support. The confidence and support of feature words are the doinant easures for text classification. Each easure can be associated with a threshold that can be adusted by user aiing different types of docuents corpus. The doinant easure is defined as do(w, c =f(conf(w, c,sup(w 1 (( conf ( w, c threshold ( sup( w threshold = 0 (( conf ( w, c < threshold ( sup( w < threshold (6 Feature words that eet the threshold (do(w, c =1 are considered as characteristic words cw(w, c. When the set of feature words of a docuent d contains characteristic word cw(w, c, the text docuent d will be classified into the category c. In other words, we can use characteristic word to deterine whether a docuent is classified into a particular category or not. Because we only use a characteristic word to deterine the category of docuents, it is briefly called one-word-location. Siilarly, we ay define (w w k, all, (w w k, c, conf(w w k, c and sup(w w k. For exaple, we define sup(w w k as the support of feature words w and w k. The value sup(w w k equals the quotient of the total nuber of docuents containing feature words w and w k aong the entire training docuents corpus divided by the total nuber aong the entire training docuents corpus, as follows sup(w w k = w ( wk, all all ( (7 In the above situation, we use the feature words w and w k to deterine the category of docuents containing feature words w and w k. Here, the set of feature words w and w k is considered as characteristic word. The case is briefly called two-word-location. We also call the characteristic word as location-word. To find the location-word is not an arduous task. In order to avoid cobinatorial explosion of feature words, we adopt heuristic ethod to choose the location-word. Our experiental knowledge revealed that, it is adequate to choose higher-frequency feature words as the alternate location-word. DISCUSSIO AD AALYSIS Precision of the particular category c is the

52 Zhang et al. / J Zheiang Univ SCI 2005 6A(1:49-55 percentage the nuber of correctly classified docuents aong category c divided by the total nuber of docuents aong category c, and is written as precision(c = TP( c all (, c (8 where, TP(c represents the nuber of correctly classified docuents aong the category c. TP is the abbreviation for true positive. In other words, TP(c is the nuber of docuents in category c classified correctly. Recall on particular category c is the percentage TP(c divided by S(c the total nuber of docuents that should be aong the category c, and is written as Substituting Eq.(11 into Eq.(8 yields expression TP(c =precision(c (all, c =precision(c (TP(c +FP(c precision( c FP( c = (15 1 precision( c Substituting Eq.(10 into Eq.(9 yields expression TP(c =recall(c TP(c +recall(c F(c recall( c F( c = (16 1 recall( c By Eqs.(15 and (16, we obtain : recall(c = TP( c Sc ( (9 (1 precision( c recall( c FP( c = F( c (1 recall( c precision( c (17 where S(c represents the total nuber of docuents that should be aong the category c. The total nuber of docuents that should belong to the category c and be incorrectly classified into other categories is represented as F(c. F eans false negative. Siilarly, the total nuber of docuents that should not belong to the category c and be incorrectly classified into c category is represented as FP(c. FP eans false positive. F(c, c represents the total nuber of docuents that should belong to the category c and be incorrectly classified into c category. Suppose is the total nuber of all categories. According to above definitions, Eqs.(10~(14 can be inferred S(c = TP(c +F(c (10 (all, c = TP(c +FP(c (11 (all = all (, c = Sc ( F(c = FP(c = = 1 = 1 = TP( c+ F( c (12 = 1 F( c, c (13 =1, F( c, c (14 =1, In addition, we can copute the accuracy of text classification on the entire docuent corpus Ω: accuracy(ω = =1 TP( c Ω (18 Ω denotes the nuber of docuents in corpus Ω. ext, we discuss the precision and recall of text classification aong the set in which each docuent contains the location-word. In the situation, the following equation can be inferred according to the definitions of Eqs.(2 and (9 recall(c = conf(w, c (19 However, precision of classification on particular category is relevant with the distribution of docuents and the relationship aong categories. It is reasonable to suppose that categories are independent and that the docuents are the average distribution. In the situation, the following equation can be inferred according to the definitions of Eqs.(2, (3 and (8 precision(c conf(w, c (20 For testing docuents corpus, the precision(c

Zhang et al. / J Zheiang Univ SCI 2005 6A(1:49-55 53 represents the precision and recall(c represents the recall on the category c when the conventional TF-IDF approach is adopted. When the iproved TF-IDF approach is adopted, the testing docuents corpus can be divided into two sets of docuents. One set is the set of docuents that contain the location-word. Another is the set of docuents that do not contain the location-word. Applying Eq.(8, we obtain the precision on the particular category c for testing docuents corpus. When the iproved TF-IDF approach is adopted, the precision equals the quotient of the total nuber of docuents classified correctly into the category c aong the entire testing docuents corpus is divided by the total nuber of docuents aong the entire testing docuents corpus: sup( w ( test conf ( w, c + 1 sup( w TP( c w L precision sup( w ( test + 1 sup( w ( TP( c + FP( c sup( w ( test conf ( w, c + 1 sup( w TP( c w L = 1 precision( c sup( w ( test + 1 sup( w TP( c + TP( c w ( L precision c sup( w ( test( conf ( w, c precision( c precision( c =precision(c + sup( w ( test precision( c + 1 sup( w TP( c w L (21 where (test represents the total nuber of testing docuents corpus and L represents the set of all characteristic words in the category c. It is coon, however that any a category only contains a particular location-word. According to Eqs.(3 and (6, sup( w (test is the nuber of docuents that contain the location-word in the testing docuents corpus and sup( w ( test conf ( w, c is the nuber of docuents that are correctly classified into category c by the location-word. TP(c is the nuber of docuents that will be correctly classified in the category c aong the entire docuents corpus by conventional TF-IDF. The value of 1 sup( w ( test is the nuber of docuents that do not contain the location-word in the testing docuents corpus. Therefore, the value of 1 sup( w TP( c is the nuber of docuents that are correctly classified by the conventional TF-IDF approach aong the set of docuents that do not contain the location-word. According to Eq.(20, the nuber of docuents that are classified into the category c by the location-word is estiated as sup( w ( test. For docuents that do not contain location-word and are classified into the category c by the conventional 1 sup( w TP( c TF-IDF, the nuber is precision( c according to Eqs.(8, (11 and (15. Siilarly, the recall on the particular category c for testing the docuents corpus is obtained by Eqs.(9, (19 and (20 when the iproved TF-IDF approach is adopted.

54 Zhang et al. / J Zheiang Univ SCI 2005 6A(1:49-55 recall sup( w ( test conf ( w, c + 1 sup( w TP( c 1 sup( w TP( c sup( w ( test + recall ( c =recall(c sup( w ( test( conf ( w, c recall( c + 1 sup( w TP( c sup( w ( test + recall( c (22 The value of sup( w ( test approxiately represents the nuber of docuents that contain the location-word and should be classified into the category c aong the corpus tested. 1 sup( w TP( c The value of represents the nuber of docuents that do not contain the recall( c location-word and should be classified into the category c aong corpus tested. According to Eqs.(21 and (22, the iproved TF-IDF approach will iprove the precision on category c if the value of conf(w, c is greater than the precision of the conventional TF-IDF approach. Siilarly, the iproved TF-IDF approach will iprove the recall on the category c if the value of conf(w, c is greater than the recall of the conventional TF-IDF approach. EXPERIMETS We perfored two sets of experients. The first experient set was designed to evaluate the distribution of location-word and the relationship between confidence and support. Another was designed to assess the validity of the iproved TF-IDF approach by coparison with the conventional TF-IDF approach. The two sets of experients were based on sae text corpus containing 2138 pieces of science and technology literature. The corpus was partitioned by 7 categories. We randoly sapled two-thirds of the corpus (1425 pieces of docuents for training and used the reaining one-third for testing. We repeated the experient 10 ties and averaged the result. Fig.1 shows the relationship between the value of confidence specified and ratio of docuents that contain the location-word aong the training corpus. When we raised the value of confidence, the nuber of docuents containing the location-word decreased. It eans that the usefulness of location-word specifying a particular category will be lowered if the confidence is increased. Meanwhile, the certainty of classification applying the location-word will be iproved. In other words, when the threshold of confidence is increased: (1 The nuber of texts containing the location-word will decrease. As a result, the nuber of texts that cannot be processed by the iproved TF-IDF approach will increase. (2 The precision and recall will be further increased for texts when the iproved TF-IDF approach applied because the confidence is increased. In brief, the threshold of confidence is a trade-off. Our experient revealed that the best value of confidence was 96%, which however, varied in different saple space. Value of confidence 100% 70% 60% 50% 40% 10% 30% 50% 70% Rate of docuents containing location-word Fig.1 Confidence and location-word Fig.2 shows the precision and recall of all seven categories when the iproved TF-IDF approach and the conventional TF-IDF approach were respectively adopted and the value of confidence was 96%. COCLUSIOS AD FURTHER WORK The iproved TF-IDF approach raises the pre-

Zhang et al. / J Zheiang Univ SCI 2005 6A(1:49-55 55 Precision 94% 92% (a Precision (b Recall 94% 92% (d Precision (c Fig.2 Contrasted precision and recall (a Closed test precision; (b Open test precision; (c Closed test recall; (d Open test recall Conventional TF-IDF Iproved TF-IDF cision and recall of text classification when the value of confidence is evaluated properly. Moreover, the iproved TF-IDF is a language-independent text classification approach. Further experiental work is needed to test the generality of these results. Although science and technology articles can be considered as representative of various types of docuents, we ust see how the findings extend to broader types of docuents such as news, web pages, eail, etc. Another research issue is about how to choose a suitable value of confidence for different corpus. References Fabrizio, S., 2002. Machine learning in autoated text categorization. ACM Coputing Surveys, 34(1:1-47. Fan, Y., Zheng, C., Wang, Q.Y., Cai, Q.S., Liu, J., 2001. Using naïve bayes to coordinate the classification of web pages. Journal of Software, 12(9:1386-1392. Huang, X.J., Wu, L.D., 1998. SVM based classification syste. Pattern Recognition and Artificial Intelligence, 11(2: 147-153 (in Chinese. Larry, M.M., Malik, Y., 2001. One-class SVMs for docuent classification. Journal of Machine Learning Research, 2:139-154. Lin, H.F., Gao, T., Yao, T.S., 2000. Chinese text visualization. Journal of ortheastern University, 21(5:501-504 (in Chinese. Olivier, D.V., 2000. Mining E-ail Authorship. Proceedings of Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Boston, USA. Salton, G., 1991. Developents in autoatic text retrieval. Science, 253:974-979. Salton, G., Buckley, C., 1988. Ter weighting approaches in autoatic text retrieval. Inforation Processing and Manageent, 24(5:513-523. Sun, J., Wang, W., Zhong, Y.X., 2001. Autoatic text categorization based on k-nearest neighbor. Journal of Beiing University of Posts & Telecos, 24(1:42-46 (in Chinese. Thorsten, J., 1996. Probabilistic Analysis of the Rocchio Algorith with TFIDF for Text Categorization. Proceedings of 14th International Conference on Machine Learning. McLean. Virginia, USA, p.78-85. Wang, Y.C., 1992. The Processing Technology and Basis for Chinese Inforation. Shanghai Jiaotong University Press, Shanghai, p.10-30 (in Chinese.