Semi-Supervised Text Classification Using Partitioned EM
|
|
- Lindsay Webster
- 7 years ago
- Views:
Transcription
1 Sem-Supervsed Text Classfcaton Usng Parttoned EM Gao Cong 1, Wee Sun Lee 1, Haoran Wu 1, Bng Lu 2 1 Department of Computer Scence, Natonal Unversty of Sngapore, Sngapore {conggao, leews, wuhaoran}@comp.nus.edu.sg 2 Department of Computer Scence, Unversty of Illnos at Chcago, Chcago, IL, Unted States lub@cs.uc.edu Abstract. Text classfcaton usng a small labeled set and a large unlabeled data s seen as a promsng technque to reduce the labor-ntensve and tme consumng effort of labelng tranng data n order to buld accurate classfers snce unlabeled data s easy to get from the Web. In [16] t has been demonstrated that an unlabeled set mproves classfcaton accuracy sgnfcantly wth only a small labeled tranng set. However, the Bayesan method used n [16] assumes that text documents are generated from a mxture model and there s a one-to-one correspondence between the mxture components and the classes. Ths may not be the case n many applcatons. In many real-lfe applcatons, a class may cover documents from many dfferent topcs, whch volates the oneto-one correspondence assumpton. In such cases, the resultng classfers can be qute poor. In ths paper, we propose a clusterng based parttonng technque to solve the problem. Ths method frst parttons the tranng documents n a herarchcal fashon usng hard clusterng. After runnng the expectaton maxmzaton (EM) algorthm n each partton, t prunes the tree usng the labeled data. The remanng tree nodes or parttons are lkely to satsfy the oneto-one correspondence condton. Extensve experments demonstrate that ths method s able to acheve a dramatc gan n classfcaton performance. Keywords. Text mnng, text classfcaton, sem-supervsed learnng, labeled and unlabeled data 1 Introducton Wth the ever-ncreasng volume of text data from varous onlne sources, t s an mportant task to categorze or classfy these text documents nto categores that are manageable and easy to understand. Text categorzaton or classfcaton ams to automatcally assgn categores or classes to unseen text documents. The task s commonly descrbed as follows: Gven a set of labeled tranng documents of n classes, the system uses ths tranng set to buld a classfer, whch s then used to classfy new documents nto the n classes. The problem has been studed extensvely n nformaton retreval, machne learnng and natural language processng. Past research has produced many text classfcaton technques, e.g., naïve Bayesan classfer [16], k-nearest neghbor [18], and support vector machnes [10]. These
2 exstng technques have been used to automatcally catalog news artcles [13], classfy Web pages [5] and learn the readng nterests of users [12]. An automatc text classfer can save consderable tme and human effort, partcularly when adng human ndexers who have already produced a large database of categorzed documents. However, the man drawback of these classc technques s that a large number of labeled tranng documents are needed n order to buld accurate classfcaton systems. The labelng s often done manually, whch s very labor ntensve and tme consumng. Recently, Ngam et al. [16] proposed to use a small set of labeled data and a large set of unlabeled data to help deal wth the problem. They showed that the unlabeled set s able to mprove the classfcaton accuracy substantally, and the number of labeled set can be extremely small. Ths approach thus saves labor, as the unlabeled set s often easy to obtan from sources such as the Web and UseNet. Ngam et al. [16] proposed a technque that utlzes the EM (Expectaton Maxmzaton) algorthm [6] to learn a classfer from both labeled and unlabeled data. The EM algorthm s a class of teratve algorthms for maxmum lkelhood estmaton n problems wth mssng data. Its obectve s to estmate the values of the mssng data (or a dstrbuton over the mssng values) usng exstng values. In the context of learnng from labeled and unlabeled sets, the EM algorthm tres to estmate the class labels of the unlabeled data. In each teraton, the EM algorthm performs value estmaton by makng use of the naïve Bayesan classfcaton method [16]. The method works qute well when the data conforms to ts generatve assumptons (see next paragraph). However, snce the algorthm n [16] makes use of the naïve Bayesan method, t suffers for the shortcomngs of the naïve Bayesan method. In partcular, n devsng the Bayesan method for text classfcaton, two assumptons are made: (1) Text documents are generated by a mxture model and there s a one-to-one mappng between mxture components and classes; (2) Document features are ndependent gven the class. Many researchers have shown that the Bayesan classfer performs surprsngly well despte the obvous volaton of (2). However, (1) often causes dffculty when t does not hold. In many real-lfe stuatons, one-to-one correspondence of mxture components and classes does not hold. That s, a class (or category) may cover a number of sub-topcs. Several examples are gven as follows: Gven one s bookmarks, one wants to fnd all the nterestng documents rom a document collecton. We can buld a bnary classfer for the purpose. However, the set of documents n the bookmarks are unlkely to belong to a sngle topc because one s often nterested n many thngs, e.g., those related to one s work and those related to one s hobbes. The labeled negatve documents are also unlkely to have come from a sngle topc. Junk e-mal flterng. Flterng unk e-mals s a typcal bnary classfcaton problem ( unk and normal e-mals) n whch each class contans multple sub-topcs. In these cases, the classfcaton accuracy usng EM algorthm suffers badly. In [16], attempts are made to relax ths restrcton. That s, t allows many mxture components to one class (each component represents a partcular sub-topc). To determne the number of mxture components for a partcular class, t uses crossvaldaton on the tranng data. However, expermental results on Reuters data n [16]
3 show that t s unsatsfactory. It has almost the same performance as the naïve Bayesan method wthout usng unlabeled data. Our expermental results on Reuters data and 20 Newsgroup (n secton 3) also show that the method n [16] of choosng the number of mxture component through cross-valdaton hardly mprove the performance of naïve Bayesan. In ths paper, we propose an alternatve method for handlng the dffculty n the case of two-class classfcaton problems. We perform herarchcal clusterng of the data to obtan a tree-based partton of the nput space (tranng documents). The dea s that f we partton the nput space nto small enough portons, a smple twocomponent mxture wll be a good enough approxmaton n each partton. We then utlze the labeled data n order to prune the tree so that for each node of the pruned tree (partton of the nput space), there s enough tranng data for satsfactory performance. The advantage of usng herarchcal clusterng s that we do not need to know a sutable number of components n advance. Instead, we only need to partton the nput space nto small enough components such that prunng can be used to effectvely fnd the correct szed components. Each test document to be classfed s frst clustered nto a leaf node of the pruned herarchcal tree from top to bottom, then we use the classfer bult n the leaf node for classfcaton. We also ntroduce another nnovaton n the use of early stoppng for the EM algorthm. We found that when the two-component mxture model s a poor model for the data, runnng EM gves much poorer result than smply usng the naïve Bayesan classfer (that does not take advantage of the unlabeled data). We use a smple test based on cross-valdaton to decde when to stop the teratons of the EM algorthm for each partton of the nput space. In ths way, we gan performance when the model s reasonable for the partton but do not suffer f the model s poor. So far, extensve experments have been conducted showng that the proposed method s able to mprove the classfcaton dramatcally wth a vared number of unlabeled data. The rest of the paper s organzed as follows. After revewng related work, we present our clusterng-based method to solve the multple mxture component problems n secton 2. In secton 3, we emprcally evaluate the proposed technque usng two document text collectons,.e., the Reuters data and the 20 newsgroup data. Fnally, we gve the conclusons n secton More Related Work The most closely related work to ours s [16] as dscussed above. Another popular technque usng unlabeled data s co-tranng [1, 7, 8, 15, 17], whch use two complementary predctors to teratvely label the unlabeled data. Clearly, these cotranng based approaches are dfferent from our proposed technque and are not specally desgned to handle classfcaton problem wth multple component n each class. Other methods for usng unlabeled data to mprove classfcaton nclude work on transducton usng support vector machnes [11], the maxmum entropy dscrmnaton method [9], and transformaton of the nput feature spaces usng nformaton of the unlabeled data [19]. The unlabeled data s also appled n herarchcal text classfcaton [2], and large mult-class text classfcaton [7].
4 2 Classfcaton Technques We frst descrbe the naïve Bayesan classfer and EM algorthm followed by our parttonng, prunng and early stoppng technques. We assume throughout ths paper that we are dealng wth a two-class classfcaton problem. 2.1 Naïve Bayesan Classfer Nave Bayesan method s one of the most popular technques for text classfcaton. It has been shown to perform extremely well n practce by many researchers [14]. Gven a set of tranng documents D, each document s consdered an ordered lst of words. We use w d,k to denote the word n poston k of document d, where each word s from the vocabulary V = < w 1, w 2,, w v >. The vocabulary s the set of all words we consder for classfcaton. We also have a set of pre-defned classes, C={c 1, c 2,, c C } (n ths paper we only consder two class classfcaton, so, C={c 1, c 2 }). In order to perform classfcaton, we need to compute the posteror probablty, P(c d ), where c s a class and d s a document. Based on the Bayesan probablty and the multnomal model, we have D ( c d ) 1 ( c ) = Ρ Ρ = (1) D and wth Laplacan smoothng, 1+ Ρ( wt c ) = V V + where N(w t,d ) s the count of the number of tmes the word w t occurs n document d and where P(c d ) {0,1} depends on the class label of the document. Fnally, assumng that the probabltes of the words are ndependent gven the class, we obtan d Ρ( c ) k= 1 Ρ( wd, k c ) Ρ( c d ) = (3) C d Ρ( c ) Ρ( w c ) D = 1 r= 1 r k = 1 d, k In the nave Bayesan classfer, the class wth the hghest P(c d ) s assgned as the class of the document. N( w, d ) Ρ( c D s= 1 = 1 t N( w, d ) Ρ( c s r d ) d ) (2) 2.2 The EM Algorthm The subsecton descrbes brefly the basc EM and ts extenson wthn the probablstc framework of nave Bayesans text classfcaton. We refer nterested readers to [16] for detals.
5 2.2.1 Basc EM The Expectaton-Maxmzaton (EM) algorthm [6] s a popular class of teratve algorthms for maxmum lkelhood estmaton n problems wth ncomplete data. It can be used to fll the mssng value n the data usng exstng values n the data by computng the expected value for each mssng value. The EM algorthm conssts of two steps, the Expectaton step, and the Maxmzaton step. The Expectaton step bascally flls n the mssng data. The parameters are estmated n the Maxmzaton step after the mssng data are flled or reconstructed. Ths leads to the next teraton of the algorthm. We use the naïve Bayesan classfer that s traned usng only the labeled examples as the startng pont for EM. For the nave Bayesan classfer, the steps used by EM are dentcal to that used to buld the classfer, equaton (3) for the Expectaton step, and equatons (1) and (2) for the Maxmzaton step. Note that gven the document, the probablty of the class now takes the value n [0,1] for the unlabeled data and n {0,1} for the labeled data. The basc EM approach depends on the assumpton that there s a one-to-one correspondence between mxture components and classes. When the assumpton does not hold, the unlabeled data often harm the accuracy of a classfer [16]. [16] proposes the multple mxture components per class technque, n whch the above assumpton s replaced wth a many-to-one correspondence between mxture components and classes Multple Mxture Components per Class (M-EM) We call EM wth multple mxture components per class M-EM. M-EM assumes that a class may be comprsed of several dfferent sub-topcs, and thus usng multple mxture components mght capture some dependences between words for the classes. Due to space lmtaton, nterested readers refer to [16] or our full paper for M-EM. One crucal problem of M-EM s how to determne the number of mxture components. [16] used leave-one-out (see Secton 2.3 for the meanng) crossvaldaton for the purpose. But n practce such a cross-valdaton approach usually cannot fnd the optmal number of mxture components. Moreover, M-EM s senstve to the number of mxture components. As a result, M-EM showed hardly any mprovement over NB n experments of [16]. 2.3 Classfcaton based on Hdden Clusters (CHC) Our classfcaton technque has three mportant components: herarchcal clusterng (parttonng), prunng and early stoppng Herarchcal Clusterng Runnng EM wth the naïve Bayesan classfer wth two mxture components fals badly when the actual number of components s larger than two. Our man dea s to recursvely partton the nput space untl the number of components n each partton s no more than two. We choose to do each stage of the parttonng usng hard clusterng wth a two-component naïve Bayesan model. The motvaton for usng herarchcal clusterng for parttonng comes from the observaton that smaller
6 components n a mxture can often be clustered nto larger components. The large components can hence be recursvely dvded untl each partton contans no more than two components so that EM wth a two-component mxture works well. The followng example llustrates a stuaton where EM wth a two-component mxture fals but herarchcal clusterng works well. Consder a stuaton where we have four newsgroups baseball, IBM hardware, hockey and Macntosh hardware. Assume that a user s nterested n baseball and IBM hardware (postve class) but not the other classes (negatve class). In ths case, because both the postve and negatve classes contan two components, the twocomponent mxture model s not sutable for the data. However, suppose we frst cluster the tranng documents nto two clusters (parttons); f the clusterng method s good enough, most documents about IBM hardware and Macntosh hardware should be clustered nto one cluster and most documents about baseball and hockey should be n another cluster. Now wthn each cluster, both the postve and negatve classes manly contan documents only from one category. Thus n each Orgnal tranng data: Postve rec.sport.baseball comp.sys.bm.pc.hardware Negatve rec.sport.hockey comp.sys.mac.hardware Partton data nto two clusters: Cluster 1 Cluster 2 rec.sport.baseball comp.sys.mac.hardware rec.sport.hockey comp.sys.bm.pc.hardware Buld text classfer n each cluster Classfer 1 Classfer 2 Postve Negatve Postve Negatve rec.sport.baseball rec.sport.hockey comp.sys.bm.pc.hardware comp.sys.mac.hardware Fg. 1. An example of usng clusterng method to help mprove data model for classfcaton partton, the assumpton of two-component mxture model s bascally satsfed and we can use EM combnng the labeled and unlabeled data of ths cluster to buld a text classfer. Fg. 1 shows the dea for the above example. For the newsgroups and classes used n ths example, we fnd that on a tranng set of 40 labeled examples, 2360 unlabeled examples and a test set of 1600 examples, naïve Bayesan wth a two-component mxtures acheves an accuracy of 68% whle runnng EM wth the help of the unlabeled examples reduces the accuracy to 59.6%. On the other hand, f we partton the space nto two before runnng EM, we acheve an accuracy of 83.1%. We use hard clusterng (also called vector quantzaton) wth the two-component naïve Bayesan model to recursvely partton the nput space. The labeled data s ncluded n the parttonng algorthm but the labels are not used. Ths s because the labels can sometmes be unhelpful for parttonng the data nto two components as n the case n Fg. 1, where addton of the labels wll only confuse the algorthm. The
7 algorthm s shown n Fg. 2. Lne 4) uses the Equaton (1) and (2) and lne 5) uses the Equaton (3). The hard clusterng method n each node s guaranteed to converge to the maxmum of the log lkelhood functon n a fnte number of steps. Cluster-Partton (Node A) 1) If node A contans no more than 2 labeled documents, then stop 2) Randomly partton A nto two sets L and R 3) Repeat untl convergence 4) Buld a naïve Bayesan model assumng that L and R contan data for two classes. 5) Reassgn each document to L and R based on the classfcaton of the naïve Bayesan model. 6) Cluster-Partton (L) 7) Cluster-Partton (R) Fg. 2. Algorthm for recursve parttonng based on hard clusterng Prunng We run the recursve parttonng algorthm untl the nodes contan no more than 2 labeled examples and then perform prunng to merge parttons. The prunng step s requred because of the followng two reasons: frst, some parttons are too small; Second, we notce that when a partton satsfes the two-component mxture model further parttonng usually deterorates the accuracy of classfer. We use the tranng error on the labeled set as the prunng crteron under the assumpton that f the EM algorthm on the two-component mxture works well, the tranng error on the labeled set wll be small; but f t does not work well, the tranng error wll be large. Frst, EM wth the two-component mxture model s run on every node n the tree (ntalzed usng the labeled data). Prunng s then performed n a bottom-up fashon from the leaves on. The algorthm s gven n Fg. 3: Prune-Partton (Node N) 1) Prune-Partton (the left chldren of N); 2) Prune-Partton (the rght chldren of N); 3) If the number of errors E P, on the labeled set n the node, s no more than the combned number of errors E C, on the labeled sets n ts two chldren; 4) then the chldren are pruned away. 5) else keep the chldren, and substtute E P by E C. Fg. 3. Algorthm for prunng parttons Early Stoppng After prunng, we are left wth a parttoned nput space where we want to run EM wth a two-component mxture model n each partton. For some data sets, even the parttoned nput space s stll not perfectly sutable for two-component mxture model. When the model s ncorrect, we notce that the generalzaton performance of the algorthm gets worse as the number of teratons of EM ncreases. In such a case, t s often better to use the naïve Bayesan classfer or to run only a small number of teratons of EM nstead of runnng EM to convergence. Moreover, dfferent number of teratons of EM may be requred for dfferent parttons of the nput space.
8 We use leave-one-out cross valdaton to estmate the generalzaton error at each teraton of EM. For the naïve Bayesan classfer, ths can be effcently calculated by subtractng the word counts of each document from the model before t s tested. For EM, we use the same approxmaton as used n [16] n order to have an effcent mplementaton. In the approxmaton, all the data (labeled and unlabeled) are used n the EM algorthm. In each teraton, before a labeled document s tested, the word counts of the document are subtracted from the model ust lke n the naïve Bayesan algorthm. We get: D 1 + (, ) ( ) (, ) ( ) ' N w Ρ = 1 t d c d N wt d v P c d v Ρ ( wt c ) = (4) V D V + ( N( w, d ) Ρ( c d ) N( w, d ) P( c d )) s= 1 = 1 s For each labeled document d v, we calculate P(c d v ) usng Equaton (3) by replacng P(w t c ) wth P (w t c ) and determne ts class. We then can compute the classfcaton accuracy of the resultng classfer for each EM teraton on the set of labeled documents. EM wll stop when the accuracy decreases. t v v Usng the CHC classfer For a test document to be classfed, the CHC classfer bult from tranng data usng the methods gven n the above s used at the followng two steps: (1) We frst cluster the documents to one leaf node of the herarchy obtaned n Secton from top to bottom. At each node of the herarchy (except the leaf nodes), there s a clusterng model obtaned from tranng data whch determnes whether the test document belongs to ts left sub-cluster or rght sub-cluster. In ths way, each test document wll be clustered nto the only leaf node. The classfer bult at the leaf node wll be used to classfy the test document at the next step. (2) We then classfy the document usng Equaton (3) wth the parameters obtaned n Secton Experments 3.1 Expermental Setup The emprcal evaluaton s done on four groups of data, each group consstng of many datasets. Two data groups are generated from the 20 newsgroups collecton, whch conssts of artcles dvded almost evenly among 20 dfferent UseNet dscusson topcs. The other two data groups used n our evaluaton are generated from the Reuters collecton, whch s composed of artcles coverng 135 potental topc categores. The four data groups are lsted as follows (detals are gven n our full paper due to space lmtaton): 20A: It s composed of 20 datasets. Each postve set s one of 20 newsgroups and the negatve set s composed of the other 19 newsgroups.
9 20B: It s composed of 20 datasets. Each of the postve sets conssts of randomly selected topcs from the 20 newsgroups collecton wth the rest of the topcs becomng the negatve sets. RA: It s composed of 10 datasets. Each postve set s one of the 10 most populous categores and each negatve set conssts of the rest of the documents n Reuters collecton. RB: It s composed of 20 datasets. Each postve set conssts of randomly selected categores from the 10 most populous categores from the Reuters collecton. Each negatve set conssts of the rest of the categores. For all four data groups, we remove stop words, but do not perform stemmng. We also remove those rare words from feature set that appear n fewer than 3 documents n the whole set of documents for both the 20 newsgroups and Reuters collectons. Results on all four data groups are reported as averages over 10 random selectons of labeled sets. For data group 20A and 20B, each test set contans 4000 documents, each unlabeled set contans documents, and the labeled sets wth dfferent szes (as shown n Fgures 4 & 5) are selected randomly from the remanng documents. For each dataset n data groups RA and RB, we use the ModApte tran/test splt, leadng to a corpus of 9603 tranng documents and 3299 test documents. In each run on the Reuters collecton, 8000 documents are selected randomly from the 9603 tranng documents as unlabeled set and labeled sets of dfferent szes are selected from the remanng documents. We compare the performance of the followng technques: Naïve Bayesan classfer (NB) EM run to convergence (EM) EM wth early stoppng (E-EM) but wthout parttonng the space Classfcaton based on Hdden Clusters (CHC),.e., EM wth parttonng usng hard clusterng and early stoppng Multple mxture components EM (M-EM). For M-EM, [16] performed experments by assumng one component for the postve class and multple components for the negatve class. The leave-one-out crossvaldaton method s used to determne the number of mxture components. We test ths technque on the datasets 20A and RA whose postve classes are generated by a sngle mxture component whle negatve has multple sub-topcs. The canddate number of mxture components s set from 1 to 20. We set ths range snce the experment results n [16] show that both the optmal selecton and the selecton va cross-valdaton of components for dataset RA are always less than 20 for dataset RA. For dataset 20A, we also set 20 as the upper bound for cross-valdaton snce the dataset contans 20 sub-topcs. We thnk that t s dffcult to extend M-EM to the case that both the negatve and postve classes consst of multple mxture components snce [16] does not descrbe how to do cross-valdaton to obtan the number of mxture components for the case. Thus, we do not test the technque for the other three datasets n our experments. We use F 1 score (see e.g. [3]) to evaluate the effectveness of our technque on the datasets n 20A, 20B, RA and RB. F 1 score s computed as follows: F 1 = 2pr/(p + r), where p s the precson and r s the recall (n our case, they are measured on the
10 Average F1 Score(%) Fg. 4. Average F 1 score of datasets n 20A Average F1 Score(%) NB EM E-EM M-EM CHC Number of labeled documents NB EM E-EM M-EM CHC Number of labeled documents Average F1 Score(%) Fg. 5. Average F 1 score of datasets n 20B Average F1 Score(%) NB EM 30 E-EM 20 CHC Number of labeled documents 40 NB EM 30 E-EM CHC Number of labeled documents Fg. 6. Average F 1 score of datasets n RA Fg. 7. Average F 1 score of datasets n RB postve class, whch s the class that we are nterested n). Notce that accuracy s not a good performance metrc for datasets n 20A, 20B, RA and RB snce hgh accuracy can often be acheved by always predctng the negatve class. 3.2 Expermental Results and Analyses Fgures 4, 5, 6 and 7 show the average F 1 score on datasets n data groups 20A, 20B, RA and RB. The vertcal axs shows the average F 1 scores on each data group and the horzontal axs ndcates the number of labeled documents. Tables 1 and 2 show the F 1 score for the postve category of each dataset wth 200 labeled documents on data groups 20A and 20B respectvely. Due to space lmtaton, we do not gve the detaled results of data groups RA and RB. CHC performs sgnfcantly better than other methods, especally when the number of labeled documents s small. For example, on data group 20A, wth 200 labeled documents, on average, NB gets F 1 -score, EM gets 0.082, E-EM acheves 0.253, M-EM gets whle CHC acheves Ths represents a 75.4% ncrease
11 Table 1. Results of datasets n 20A wth 200 labeled documents F 1-score Set NB EM E-EM M-EM CHC Ave Table 2. Results of datasets n 20B wth 200 labeled documents F 1-score Set NB EM E-EM CHC Ave of F 1 score of CHC aganst NB, a 439% ncrease aganst EM, a 74.7% ncrease aganst E-EM and a 46.4% ncrease aganst M-EM. Fgures 4 and 6 show that M-EM do not consstently mprove the performance of NB. In fact, M-EM bascally acheved smlar results as NB wth small numbers of labeled documents. Ths s consstent wth the results reported n [16], where experments usng 50 labeled documents also showed that sometmes M-EM works better than NB whle at other tmes t does not. One nterestng fndng of our experments s that E-EM can mprove the performance of NB when the number of labeled documents s small. Ths means that EM may mprove the performance of NB at the frst few teratons wth small label sets. The performance can deterorate wth addtonal teratons as can be seen from the results of NB, (converged) EM and E-EM n Fgures 4 to 7. Experments on the frst four data groups show that converged EM does not mprove the performance of NB n general. Ths s because that the underlyng model of EM does not ft the data that covers multple categores. The reason for CHC to acheve better performance than NB whle EM s worse than NB s that CHC gets better components (satsfes smple two-component mxture) by clusterng and prunng. We also do experments on a group of datasets whose postve set and negatve set s one category of 20 newsgroups respectvely,.e. the two-component model fts the data well. We fnd that EM s able to perform better than NB n the case, and we also observe that our CHC method can acheve smlar accuracy as EM n ths case. Due to space lmtaton, we do not report detaled results here. In summary, we can conclude that no matter what the characterstcs of the datasets are, CHC does much better than or as well as other technques.
12 4 Conclusons We proposed a new method for mprovng the performance of EM wth the naïve Bayesan classfer for the problem of learnng wth labeled and unlabeled examples when each class s naturally composed of many clusters. Our method s based on recursve parttonng and prunng wth the am of parttonng the nput space so that wthn each partton, there s a one-to-one correspondence between the classes and the clusters n the partton. Extensve experments show that the proposed method sgnfcantly mproves the performance of the EM wth naïve Bayesan classfer. References 1. Blum, A., Mtchell, T. Combnng labeled and unlabeled data wth co-tranng. Proceedngs of COLT-98, (1998) Boyapat,V. Improvng herarchcal text classfcaton usng unlabeled data. Proceedngs of SIGIR, (2002) 3. Bollmann, P., Chernavsky, V. Measurement-theoretcal nvestgaton of the mz-metrc. Informaton Retreval Research (1981) Cohen, W. Automatcally extractng features for concept learnng from the Web. Proceedngs of the ICML, (2000) 5. Craven, M., DPasquo, D., Fretag, D., MaCallum, A., Mtchell, T., Ngam, K., Slattery, S. Learnng to extract symbolc knowledge from the World Wde Web. Proceedngs of AAAI- 98, (1998) Dempster, A., Lard, N., Rubn, D. Maxmum lkelhood from ncomplete data va the EM algorthm. Journal of the Royal Statstcal Socety, Seres B, 39(1), (1977) R. Ghan, Combnng Labeled and Unlabeled Data for MultClass Text Categorzaton. In Proceedngs of the ICML, (2002) 8. Goldman, S., Zhou, Y., enhanced supervsed learnng wth unlabeled data. In Proceedngs of the ICML, (2000) 9. Jaakkola, T., Mela, M., Jebara, T. Maxmum entropy dscrmnaton. Advances n Neural Informaton Pcocessng Systems 12, (2000) Joachms, T. (1998). Text categorzaton wth Support Vector Machnes: learnng wth many relevant features. Proceedngs of ECML-98, (1998) Joachms, T. (1999). Transductve nference for text classfcaton usng support vector machnes. Proceedngs of ICML-99,(1999) Lang, K. Newsweeder: Learnng to flter netnews. Proceedngs of ICML, (1995) Lews, D. Gale, W. A sequental algorthm for tranng text classfers. Proceedngs of SIGIR-94, (1994) McCallum, A., Ngam, K. A comparson of event models for naïve Bayes text classfcaton. AAAI-98 Workshop on Learnng for Text Categorzaton. AAAI Press (1998) 15. Ngam, K., Ghan, R. Analyzng the effectveness and applcablty of co-tranng. Nnth Internatonal Conference on Informaton and Knowledge Management (2000) Ngam, K., McCallum, A., Thrun, S., Mtchell, T. Text Classfcaton from Labeled and Unlabeled Documents usng EM. Machne Learnng, 39(2/3), (2000) Raskutt, B. Ferra, H. Kowalczyk, A. Combnng Clusterng and Co-tranng to Enhance Text Classfcaton Usng Unlabelled Data. In Proceedngs of the KDD, (2002) 18. Yang, Y. An evaluaton of statstcal approaches to text categorzaton. Journal of Informaton Retreval, 1, 67-88, (1999) 19. Zelkovtz, S. Hrsh, H. usng LSI for text classfcaton n the presence of background text. In Proceedngs of the CIKM, (2001)
The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis
The Development of Web Log Mnng Based on Improve-K-Means Clusterng Analyss TngZhong Wang * College of Informaton Technology, Luoyang Normal Unversty, Luoyang, 471022, Chna wangtngzhong2@sna.cn Abstract.
More informationWhat is Candidate Sampling
What s Canddate Samplng Say we have a multclass or mult label problem where each tranng example ( x, T ) conssts of a context x a small (mult)set of target classes T out of a large unverse L of possble
More informationForecasting the Direction and Strength of Stock Market Movement
Forecastng the Drecton and Strength of Stock Market Movement Jngwe Chen Mng Chen Nan Ye cjngwe@stanford.edu mchen5@stanford.edu nanye@stanford.edu Abstract - Stock market s one of the most complcated systems
More informationDescriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications
CMSC828G Prncples of Data Mnng Lecture #9 Today s Readng: HMS, chapter 9 Today s Lecture: Descrptve Modelng Clusterng Algorthms Descrptve Models model presents the man features of the data, a global summary
More informationLogistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification
Lecture 4: More classfers and classes C4B Machne Learnng Hlary 20 A. Zsserman Logstc regresson Loss functons revsted Adaboost Loss functons revsted Optmzaton Multple class classfcaton Logstc Regresson
More informationLecture 2: Single Layer Perceptrons Kevin Swingler
Lecture 2: Sngle Layer Perceptrons Kevn Sngler kms@cs.str.ac.uk Recap: McCulloch-Ptts Neuron Ths vastly smplfed model of real neurons s also knon as a Threshold Logc Unt: W 2 A Y 3 n W n. A set of synapses
More informationThe OC Curve of Attribute Acceptance Plans
The OC Curve of Attrbute Acceptance Plans The Operatng Characterstc (OC) curve descrbes the probablty of acceptng a lot as a functon of the lot s qualty. Fgure 1 shows a typcal OC Curve. 10 8 6 4 1 3 4
More informationSingle and multiple stage classifiers implementing logistic discrimination
Sngle and multple stage classfers mplementng logstc dscrmnaton Hélo Radke Bttencourt 1 Dens Alter de Olvera Moraes 2 Vctor Haertel 2 1 Pontfíca Unversdade Católca do Ro Grande do Sul - PUCRS Av. Ipranga,
More informationMining Multiple Large Data Sources
The Internatonal Arab Journal of Informaton Technology, Vol. 7, No. 3, July 2 24 Mnng Multple Large Data Sources Anmesh Adhkar, Pralhad Ramachandrarao 2, Bhanu Prasad 3, and Jhml Adhkar 4 Department of
More informationForecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network
700 Proceedngs of the 8th Internatonal Conference on Innovaton & Management Forecastng the Demand of Emergency Supples: Based on the CBR Theory and BP Neural Network Fu Deqang, Lu Yun, L Changbng School
More informationModule 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur
Module LOSSLESS IMAGE COMPRESSION SYSTEMS Lesson 3 Lossless Compresson: Huffman Codng Instructonal Objectves At the end of ths lesson, the students should be able to:. Defne and measure source entropy..
More information320 The Internatonal Arab Journal of Informaton Technology, Vol. 5, No. 3, July 2008 Comparsons Between Data Clusterng Algorthms Osama Abu Abbas Computer Scence Department, Yarmouk Unversty, Jordan Abstract:
More informationAn Alternative Way to Measure Private Equity Performance
An Alternatve Way to Measure Prvate Equty Performance Peter Todd Parlux Investment Technology LLC Summary Internal Rate of Return (IRR) s probably the most common way to measure the performance of prvate
More informationAn Interest-Oriented Network Evolution Mechanism for Online Communities
An Interest-Orented Network Evoluton Mechansm for Onlne Communtes Cahong Sun and Xaopng Yang School of Informaton, Renmn Unversty of Chna, Bejng 100872, P.R. Chna {chsun,yang}@ruc.edu.cn Abstract. Onlne
More informationWeb Spam Detection Using Machine Learning in Specific Domain Features
Journal of Informaton Assurance and Securty 3 (2008) 220-229 Web Spam Detecton Usng Machne Learnng n Specfc Doman Features Hassan Najadat 1, Ismal Hmed 2 Department of Computer Informaton Systems Faculty
More informationTitle Language Model for Information Retrieval
Ttle Language Model for Informaton Retreval Rong Jn Language Technologes Insttute School of Computer Scence Carnege Mellon Unversty Alex G. Hauptmann Computer Scence Department School of Computer Scence
More informationLogistic Regression. Steve Kroon
Logstc Regresson Steve Kroon Course notes sectons: 24.3-24.4 Dsclamer: these notes do not explctly ndcate whether values are vectors or scalars, but expects the reader to dscern ths from the context. Scenaro
More informationBayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending
Proceedngs of 2012 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 25 (2012) (2012) IACSIT Press, Sngapore Bayesan Network Based Causal Relatonshp Identfcaton and Fundng Success
More informationDocument Clustering Analysis Based on Hybrid PSO+K-means Algorithm
Document Clusterng Analyss Based on Hybrd PSO+K-means Algorthm Xaohu Cu, Thomas E. Potok Appled Software Engneerng Research Group, Computatonal Scences and Engneerng Dvson, Oak Rdge Natonal Laboratory,
More information1 Example 1: Axis-aligned rectangles
COS 511: Theoretcal Machne Learnng Lecturer: Rob Schapre Lecture # 6 Scrbe: Aaron Schld February 21, 2013 Last class, we dscussed an analogue for Occam s Razor for nfnte hypothess spaces that, n conjuncton
More informationHow Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence
1 st Internatonal Symposum on Imprecse Probabltes and Ther Applcatons, Ghent, Belgum, 29 June 2 July 1999 How Sets of Coherent Probabltes May Serve as Models for Degrees of Incoherence Mar J. Schervsh
More informationSPEE Recommended Evaluation Practice #6 Definition of Decline Curve Parameters Background:
SPEE Recommended Evaluaton Practce #6 efnton of eclne Curve Parameters Background: The producton hstores of ol and gas wells can be analyzed to estmate reserves and future ol and gas producton rates and
More informationSearching for Interacting Features for Spam Filtering
Searchng for Interactng Features for Spam Flterng Chuanlang Chen 1, Yun-Chao Gong 2, Rongfang Be 1,, and X. Z. Gao 3 1 Department of Computer Scence, Bejng Normal Unversty, Bejng 100875, Chna 2 Software
More informationAn Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services
An Evaluaton of the Extended Logstc, Smple Logstc, and Gompertz Models for Forecastng Short Lfecycle Products and Servces Charles V. Trappey a,1, Hsn-yng Wu b a Professor (Management Scence), Natonal Chao
More informationImproved SVM in Cloud Computing Information Mining
Internatonal Journal of Grd Dstrbuton Computng Vol.8, No.1 (015), pp.33-40 http://dx.do.org/10.1457/jgdc.015.8.1.04 Improved n Cloud Computng Informaton Mnng Lvshuhong (ZhengDe polytechnc college JangSu
More informationCS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements
Lecture 3 Densty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 5329 Sennott Square Next lecture: Matlab tutoral Announcements Rules for attendng the class: Regstered for credt Regstered for audt (only f there
More informationProbabilistic Latent Semantic User Segmentation for Behavioral Targeted Advertising*
Probablstc Latent Semantc User Segmentaton for Behavoral Targeted Advertsng* Xaohu Wu 1,2, Jun Yan 2, Nng Lu 2, Shucheng Yan 3, Yng Chen 1, Zheng Chen 2 1 Department of Computer Scence Bejng Insttute of
More informationAbstract. Clustering ensembles have emerged as a powerful method for improving both the
Clusterng Ensembles: {topchyal, Models jan, of punch}@cse.msu.edu Consensus and Weak Parttons * Alexander Topchy, Anl K. Jan, and Wllam Punch Department of Computer Scence and Engneerng, Mchgan State Unversty
More informationRecurrence. 1 Definitions and main statements
Recurrence 1 Defntons and man statements Let X n, n = 0, 1, 2,... be a MC wth the state space S = (1, 2,...), transton probabltes p j = P {X n+1 = j X n = }, and the transton matrx P = (p j ),j S def.
More information1. Measuring association using correlation and regression
How to measure assocaton I: Correlaton. 1. Measurng assocaton usng correlaton and regresson We often would lke to know how one varable, such as a mother's weght, s related to another varable, such as a
More informationSupport Vector Machines
Support Vector Machnes Max Wellng Department of Computer Scence Unversty of Toronto 10 Kng s College Road Toronto, M5S 3G5 Canada wellng@cs.toronto.edu Abstract Ths s a note to explan support vector machnes.
More informationbenefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).
REVIEW OF RISK MANAGEMENT CONCEPTS LOSS DISTRIBUTIONS AND INSURANCE Loss and nsurance: When someone s subject to the rsk of ncurrng a fnancal loss, the loss s generally modeled usng a random varable or
More informationAnswer: A). There is a flatter IS curve in the high MPC economy. Original LM LM after increase in M. IS curve for low MPC economy
4.02 Quz Solutons Fall 2004 Multple-Choce Questons (30/00 ponts) Please, crcle the correct answer for each of the followng 0 multple-choce questons. For each queston, only one of the answers s correct.
More informationBayesian Cluster Ensembles
Bayesan Cluster Ensembles Hongjun Wang 1, Hanhua Shan 2 and Arndam Banerjee 2 1 Informaton Research Insttute, Southwest Jaotong Unversty, Chengdu, Schuan, 610031, Chna 2 Department of Computer Scence &
More informationHow To Know The Components Of Mean Squared Error Of Herarchcal Estmator S
S C H E D A E I N F O R M A T I C A E VOLUME 0 0 On Mean Squared Error of Herarchcal Estmator Stans law Brodowsk Faculty of Physcs, Astronomy, and Appled Computer Scence, Jagellonan Unversty, Reymonta
More informationSection 5.4 Annuities, Present Value, and Amortization
Secton 5.4 Annutes, Present Value, and Amortzaton Present Value In Secton 5.2, we saw that the present value of A dollars at nterest rate per perod for n perods s the amount that must be deposted today
More informationA Hierarchical Anomaly Network Intrusion Detection System using Neural Network Classification
IDC IDC A Herarchcal Anomaly Network Intruson Detecton System usng Neural Network Classfcaton ZHENG ZHANG, JUN LI, C. N. MANIKOPOULOS, JAY JORGENSON and JOSE UCLES ECE Department, New Jersey Inst. of Tech.,
More informationLuby s Alg. for Maximal Independent Sets using Pairwise Independence
Lecture Notes for Randomzed Algorthms Luby s Alg. for Maxmal Independent Sets usng Parwse Independence Last Updated by Erc Vgoda on February, 006 8. Maxmal Independent Sets For a graph G = (V, E), an ndependent
More informationHallucinating Multiple Occluded CCTV Face Images of Different Resolutions
In Proc. IEEE Internatonal Conference on Advanced Vdeo and Sgnal based Survellance (AVSS 05), September 2005 Hallucnatng Multple Occluded CCTV Face Images of Dfferent Resolutons Ku Ja Shaogang Gong Computer
More informationVision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION
Vson Mouse Saurabh Sarkar a* a Unversty of Cncnnat, Cncnnat, USA ABSTRACT The report dscusses a vson based approach towards trackng of eyes and fngers. The report descrbes the process of locatng the possble
More informationLatent Class Regression. Statistics for Psychosocial Research II: Structural Models December 4 and 6, 2006
Latent Class Regresson Statstcs for Psychosocal Research II: Structural Models December 4 and 6, 2006 Latent Class Regresson (LCR) What s t and when do we use t? Recall the standard latent class model
More informationUsing Content-Based Filtering for Recommendation 1
Usng Content-Based Flterng for Recommendaton 1 Robn van Meteren 1 and Maarten van Someren 2 1 NetlnQ Group, Gerard Brandtstraat 26-28, 1054 JK, Amsterdam, The Netherlands, robn@netlnq.nl 2 Unversty of
More informationFast Fuzzy Clustering of Web Page Collections
Fast Fuzzy Clusterng of Web Page Collectons Chrstan Borgelt and Andreas Nürnberger Dept. of Knowledge Processng and Language Engneerng Otto-von-Guercke-Unversty of Magdeburg Unverstätsplatz, D-396 Magdeburg,
More informationPerformance Analysis and Coding Strategy of ECOC SVMs
Internatonal Journal of Grd and Dstrbuted Computng Vol.7, No. (04), pp.67-76 http://dx.do.org/0.457/jgdc.04.7..07 Performance Analyss and Codng Strategy of ECOC SVMs Zhgang Yan, and Yuanxuan Yang, School
More informationThe Greedy Method. Introduction. 0/1 Knapsack Problem
The Greedy Method Introducton We have completed data structures. We now are gong to look at algorthm desgn methods. Often we are lookng at optmzaton problems whose performance s exponental. For an optmzaton
More informationCalculation of Sampling Weights
Perre Foy Statstcs Canada 4 Calculaton of Samplng Weghts 4.1 OVERVIEW The basc sample desgn used n TIMSS Populatons 1 and 2 was a two-stage stratfed cluster desgn. 1 The frst stage conssted of a sample
More informationECE544NA Final Project: Robust Machine Learning Hardware via Classifier Ensemble
1 ECE544NA Fnal Project: Robust Machne Learnng Hardware va Classfer Ensemble Sa Zhang, szhang12@llnos.edu Dept. of Electr. & Comput. Eng., Unv. of Illnos at Urbana-Champagn, Urbana, IL, USA Abstract In
More informationActive Learning for Interactive Visualization
Actve Learnng for Interactve Vsualzaton Tomoharu Iwata Nel Houlsby Zoubn Ghahraman Unversty of Cambrdge Unversty of Cambrdge Unversty of Cambrdge Abstract Many automatc vsualzaton methods have been. However,
More informationFeature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College
Feature selecton for ntruson detecton Slobodan Petrovć NISlab, Gjøvk Unversty College Contents The feature selecton problem Intruson detecton Traffc features relevant for IDS The CFS measure The mrmr measure
More informationA hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm
Avalable onlne www.ocpr.com Journal of Chemcal and Pharmaceutcal Research, 2014, 6(7):1884-1889 Research Artcle ISSN : 0975-7384 CODEN(USA) : JCPRC5 A hybrd global optmzaton algorthm based on parallel
More informationHow To Calculate The Accountng Perod Of Nequalty
Inequalty and The Accountng Perod Quentn Wodon and Shlomo Ytzha World Ban and Hebrew Unversty September Abstract Income nequalty typcally declnes wth the length of tme taen nto account for measurement.
More informationAn Enhanced Super-Resolution System with Improved Image Registration, Automatic Image Selection, and Image Enhancement
An Enhanced Super-Resoluton System wth Improved Image Regstraton, Automatc Image Selecton, and Image Enhancement Yu-Chuan Kuo ( ), Chen-Yu Chen ( ), and Chou-Shann Fuh ( ) Department of Computer Scence
More informationPlanning for Marketing Campaigns
Plannng for Marketng Campagns Qang Yang and Hong Cheng Department of Computer Scence Hong Kong Unversty of Scence and Technology Clearwater Bay, Kowloon, Hong Kong, Chna (qyang, csch)@cs.ust.hk Abstract
More informationCan Auto Liability Insurance Purchases Signal Risk Attitude?
Internatonal Journal of Busness and Economcs, 2011, Vol. 10, No. 2, 159-164 Can Auto Lablty Insurance Purchases Sgnal Rsk Atttude? Chu-Shu L Department of Internatonal Busness, Asa Unversty, Tawan Sheng-Chang
More informationUsing Supervised Clustering Technique to Classify Received Messages in 137 Call Center of Tehran City Council
Usng Supervsed Clusterng Technque to Classfy Receved Messages n 137 Call Center of Tehran Cty Councl Mahdyeh Haghr 1*, Hamd Hassanpour 2 (1) Informaton Technology engneerng/e-commerce, Shraz Unversty (2)
More informationEnterprise Master Patient Index
Enterprse Master Patent Index Healthcare data are captured n many dfferent settngs such as hosptals, clncs, labs, and physcan offces. Accordng to a report by the CDC, patents n the Unted States made an
More informationTHE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION
Internatonal Journal of Electronc Busness Management, Vol. 3, No. 4, pp. 30-30 (2005) 30 THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION Yu-Mn Chang *, Yu-Cheh
More information) of the Cell class is created containing information about events associated with the cell. Events are added to the Cell instance
Calbraton Method Instances of the Cell class (one nstance for each FMS cell) contan ADC raw data and methods assocated wth each partcular FMS cell. The calbraton method ncludes event selecton (Class Cell
More informationRisk-based Fatigue Estimate of Deep Water Risers -- Course Project for EM388F: Fracture Mechanics, Spring 2008
Rsk-based Fatgue Estmate of Deep Water Rsers -- Course Project for EM388F: Fracture Mechancs, Sprng 2008 Chen Sh Department of Cvl, Archtectural, and Envronmental Engneerng The Unversty of Texas at Austn
More informationPREDICTION OF MISSING DATA IN CARDIOTOCOGRAMS USING THE EXPECTATION MAXIMIZATION ALGORITHM
18-19 October 2001, Hotel Kontokal Bay, Corfu PREDICTIO OF MISSIG DATA I CARDIOTOCOGRAMS USIG THE EXPECTATIO MAXIMIZATIO ALGORITHM G. okas Department of Electrcal and Computer Engneerng, Unversty of Patras,
More informationFace Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)
Face Recognton Problem Face Verfcaton Problem Face Verfcaton (1:1 matchng) Querymage face query Face Recognton (1:N matchng) database Applcaton: Access Control www.vsage.com www.vsoncs.com Bometrc Authentcaton
More informationLogical Development Of Vogel s Approximation Method (LD-VAM): An Approach To Find Basic Feasible Solution Of Transportation Problem
INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME, ISSUE, FEBRUARY ISSN 77-866 Logcal Development Of Vogel s Approxmaton Method (LD- An Approach To Fnd Basc Feasble Soluton Of Transportaton
More informationCluster Analysis. Cluster Analysis
Cluster Analyss Cluster Analyss What s Cluster Analyss? Types of Data n Cluster Analyss A Categorzaton of Maor Clusterng Methos Parttonng Methos Herarchcal Methos Densty-Base Methos Gr-Base Methos Moel-Base
More informationReview of Hierarchical Models for Data Clustering and Visualization
Revew of Herarchcal Models for Data Clusterng and Vsualzaton Lola Vcente & Alfredo Velldo Grup de Soft Computng Seccó d Intel lgènca Artfcal Departament de Llenguatges Sstemes Informàtcs Unverstat Poltècnca
More informationBag-of-Words models. Lecture 9. Slides from: S. Lazebnik, A. Torralba, L. Fei-Fei, D. Lowe, C. Szurka
Bag-of-Words models Lecture 9 Sldes from: S. Lazebnk, A. Torralba, L. Fe-Fe, D. Lowe, C. Szurka Bag-of-features models Overvew: Bag-of-features models Orgns and motvaton Image representaton Dscrmnatve
More informationA Dynamic Load Balancing for Massive Multiplayer Online Game Server
A Dynamc Load Balancng for Massve Multplayer Onlne Game Server Jungyoul Lm, Jaeyong Chung, Jnryong Km and Kwanghyun Shm Dgtal Content Research Dvson Electroncs and Telecommuncatons Research Insttute Daejeon,
More informationAdaptive Fractal Image Coding in the Frequency Domain
PROCEEDINGS OF INTERNATIONAL WORKSHOP ON IMAGE PROCESSING: THEORY, METHODOLOGY, SYSTEMS AND APPLICATIONS 2-22 JUNE,1994 BUDAPEST,HUNGARY Adaptve Fractal Image Codng n the Frequency Doman K AI UWE BARTHEL
More informationPSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12
14 The Ch-squared dstrbuton PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 1 If a normal varable X, havng mean µ and varance σ, s standardsed, the new varable Z has a mean 0 and varance 1. When ths standardsed
More informationCHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol
CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK Sample Stablty Protocol Background The Cholesterol Reference Method Laboratory Network (CRMLN) developed certfcaton protocols for total cholesterol, HDL
More informationGender Classification for Real-Time Audience Analysis System
Gender Classfcaton for Real-Tme Audence Analyss System Vladmr Khryashchev, Lev Shmaglt, Andrey Shemyakov, Anton Lebedev Yaroslavl State Unversty Yaroslavl, Russa vhr@yandex.ru, shmaglt_lev@yahoo.com, andrey.shemakov@gmal.com,
More informationA Probabilistic Theory of Coherence
A Probablstc Theory of Coherence BRANDEN FITELSON. The Coherence Measure C Let E be a set of n propostons E,..., E n. We seek a probablstc measure C(E) of the degree of coherence of E. Intutvely, we want
More information8 Algorithm for Binary Searching in Trees
8 Algorthm for Bnary Searchng n Trees In ths secton we present our algorthm for bnary searchng n trees. A crucal observaton employed by the algorthm s that ths problem can be effcently solved when the
More informationDEFINING %COMPLETE IN MICROSOFT PROJECT
CelersSystems DEFINING %COMPLETE IN MICROSOFT PROJECT PREPARED BY James E Aksel, PMP, PMI-SP, MVP For Addtonal Informaton about Earned Value Management Systems and reportng, please contact: CelersSystems,
More informationL10: Linear discriminants analysis
L0: Lnear dscrmnants analyss Lnear dscrmnant analyss, two classes Lnear dscrmnant analyss, C classes LDA vs. PCA Lmtatons of LDA Varants of LDA Other dmensonalty reducton methods CSCE 666 Pattern Analyss
More informationINVESTIGATION OF VEHICULAR USERS FAIRNESS IN CDMA-HDR NETWORKS
21 22 September 2007, BULGARIA 119 Proceedngs of the Internatonal Conference on Informaton Technologes (InfoTech-2007) 21 st 22 nd September 2007, Bulgara vol. 2 INVESTIGATION OF VEHICULAR USERS FAIRNESS
More informationHow To Understand The Results Of The German Meris Cloud And Water Vapour Product
Ttel: Project: Doc. No.: MERIS level 3 cloud and water vapour products MAPP MAPP-ATBD-ClWVL3 Issue: 1 Revson: 0 Date: 9.12.1998 Functon Name Organsaton Sgnature Date Author: Bennartz FUB Preusker FUB Schüller
More informationDscusses and Recommendaton Systems
10 Content-based Recommendaton Systems Mchael J. Pazzan 1 and Danel Bllsus 2 1 Rutgers Unversty, ASBIII, 3 Rutgers Plaza New Brunswck, NJ 08901 pazzan@rutgers.edu 2 FX Palo Alto Laboratory, Inc., 3400
More informationCluster Analysis of Data Points using Partitioning and Probabilistic Model-based Algorithms
Internatonal Journal of Appled Informaton Systems (IJAIS) ISSN : 2249-0868 Foundaton of Computer Scence FCS, New York, USA Volume 7 No.7, August 2014 www.jas.org Cluster Analyss of Data Ponts usng Parttonng
More informationIMPACT ANALYSIS OF A CELLULAR PHONE
4 th ASA & μeta Internatonal Conference IMPACT AALYSIS OF A CELLULAR PHOE We Lu, 2 Hongy L Bejng FEAonlne Engneerng Co.,Ltd. Bejng, Chna ABSTRACT Drop test smulaton plays an mportant role n nvestgatng
More informationDesign of Output Codes for Fast Covering Learning using Basic Decomposition Techniques
Journal of Computer Scence (7): 565-57, 6 ISSN 59-66 6 Scence Publcatons Desgn of Output Codes for Fast Coverng Learnng usng Basc Decomposton Technques Aruna Twar and Narendra S. Chaudhar, Faculty of Computer
More informationA Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression
Novel Methodology of Workng Captal Management for Large Publc Constructons by Usng Fuzzy S-curve Regresson Cheng-Wu Chen, Morrs H. L. Wang and Tng-Ya Hseh Department of Cvl Engneerng, Natonal Central Unversty,
More informationEnriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger
Enrchng the Knowledge Sources Used n a Maxmum Entropy Part-of-Speech Tagger Krstna Toutanova Dept of Computer Scence Gates Bldg 4A, 353 Serra Mall Stanford, CA 94305 9040, USA krstna@cs.stanford.edu Chrstopher
More informationAn interactive system for structure-based ASCII art creation
An nteractve system for structure-based ASCII art creaton Katsunor Myake Henry Johan Tomoyuk Nshta The Unversty of Tokyo Nanyang Technologcal Unversty Abstract Non-Photorealstc Renderng (NPR), whose am
More informationChapter 6. Classification and Prediction
Chapter 6. Classfcaton and Predcton What s classfcaton? What s Lazy learners (or learnng from predcton? your neghbors) Issues regardng classfcaton and Frequent-pattern-based predcton classfcaton Classfcaton
More informationData Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network *
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 24, 819-840 (2008) Data Broadcast on a Mult-System Heterogeneous Overlayed Wreless Network * Department of Computer Scence Natonal Chao Tung Unversty Hsnchu,
More informationProduct Quality and Safety Incident Information Tracking Based on Web
Product Qualty and Safety Incdent Informaton Trackng Based on Web News 1 Yuexang Yang, 2 Correspondng Author Yyang Wang, 2 Shan Yu, 2 Jng Q, 1 Hual Ca 1 Chna Natonal Insttute of Standardzaton, Beng 100088,
More informationOn the Optimal Control of a Cascade of Hydro-Electric Power Stations
On the Optmal Control of a Cascade of Hydro-Electrc Power Statons M.C.M. Guedes a, A.F. Rbero a, G.V. Smrnov b and S. Vlela c a Department of Mathematcs, School of Scences, Unversty of Porto, Portugal;
More informationA DATA MINING APPLICATION IN A STUDENT DATABASE
JOURNAL OF AERONAUTICS AND SPACE TECHNOLOGIES JULY 005 VOLUME NUMBER (53-57) A DATA MINING APPLICATION IN A STUDENT DATABASE Şenol Zafer ERDOĞAN Maltepe Ünversty Faculty of Engneerng Büyükbakkalköy-Istanbul
More informationNumber of Levels Cumulative Annual operating Income per year construction costs costs ($) ($) ($) 1 600,000 35,000 100,000 2 2,200,000 60,000 350,000
Problem Set 5 Solutons 1 MIT s consderng buldng a new car park near Kendall Square. o unversty funds are avalable (overhead rates are under pressure and the new faclty would have to pay for tself from
More informationAN E-MAIL FILTERING AGENT BASED ON SUPPORT VECTOR MACHINES
BULETINUL INSTITUTULUI POLITEHNIC DIN IAŞI Publcat de Unverstatea Tehncă Gheorghe Asach dn Iaş Tomul LVI (LX), Fasc. 3, 200 SecŃa AUTOMATICĂ ş CALCULATOARE AN E-MAIL FILTERING AGENT BASED ON SUPPORT VECTOR
More informationMultiple-Period Attribution: Residuals and Compounding
Multple-Perod Attrbuton: Resduals and Compoundng Our revewer gave these authors full marks for dealng wth an ssue that performance measurers and vendors often regard as propretary nformaton. In 1994, Dens
More informationAn Analysis of Central Processor Scheduling in Multiprogrammed Computer Systems
STAN-CS-73-355 I SU-SE-73-013 An Analyss of Central Processor Schedulng n Multprogrammed Computer Systems (Dgest Edton) by Thomas G. Prce October 1972 Techncal Report No. 57 Reproducton n whole or n part
More information14.74 Lecture 5: Health (2)
14.74 Lecture 5: Health (2) Esther Duflo February 17, 2004 1 Possble Interventons Last tme we dscussed possble nterventons. Let s take one: provdng ron supplements to people, for example. From the data,
More informationRank Based Clustering For Document Retrieval From Biomedical Databases
Jayanth Mancassamy et al /Internatonal Journal on Computer Scence and Engneerng Vol.1(2), 2009, 111-115 Rank Based Clusterng For Document Retreval From Bomedcal Databases Jayanth Mancassamy Department
More informationProject Networks With Mixed-Time Constraints
Project Networs Wth Mxed-Tme Constrants L Caccetta and B Wattananon Western Australan Centre of Excellence n Industral Optmsaton (WACEIO) Curtn Unversty of Technology GPO Box U1987 Perth Western Australa
More informationNEURO-FUZZY INFERENCE SYSTEM FOR E-COMMERCE WEBSITE EVALUATION
NEURO-FUZZY INFERENE SYSTEM FOR E-OMMERE WEBSITE EVALUATION Huan Lu, School of Software, Harbn Unversty of Scence and Technology, Harbn, hna Faculty of Appled Mathematcs and omputer Scence, Belarusan State
More informationLearning from Large Distributed Data: A Scaling Down Sampling Scheme for Efficient Data Processing
Internatonal Journal of Machne Learnng and Computng, Vol. 4, No. 3, June 04 Learnng from Large Dstrbuted Data: A Scalng Down Samplng Scheme for Effcent Data Processng Che Ngufor and Janusz Wojtusak part
More informationCHAPTER 14 MORE ABOUT REGRESSION
CHAPTER 14 MORE ABOUT REGRESSION We learned n Chapter 5 that often a straght lne descrbes the pattern of a relatonshp between two quanttatve varables. For nstance, n Example 5.1 we explored the relatonshp
More informationLearning from Multiple Outlooks
Learnng from Multple Outlooks Maayan Harel Department of Electrcal Engneerng, Technon, Hafa, Israel She Mannor Department of Electrcal Engneerng, Technon, Hafa, Israel maayanga@tx.technon.ac.l she@ee.technon.ac.l
More informationDynamic Resource Allocation for MapReduce with Partitioning Skew
Ths artcle has been accepted for publcaton n a future ssue of ths journal, but has not been fully edted. Content may change pror to fnal publcaton. Ctaton nformaton: DOI 1.119/TC.216.253286, IEEE Transactons
More information