Semi-Supervised Text Classification Using Partitioned EM

Size: px
Start display at page:

Download "Semi-Supervised Text Classification Using Partitioned EM"

Transcription

1 Sem-Supervsed Text Classfcaton Usng Parttoned EM Gao Cong 1, Wee Sun Lee 1, Haoran Wu 1, Bng Lu 2 1 Department of Computer Scence, Natonal Unversty of Sngapore, Sngapore {conggao, leews, wuhaoran}@comp.nus.edu.sg 2 Department of Computer Scence, Unversty of Illnos at Chcago, Chcago, IL, Unted States lub@cs.uc.edu Abstract. Text classfcaton usng a small labeled set and a large unlabeled data s seen as a promsng technque to reduce the labor-ntensve and tme consumng effort of labelng tranng data n order to buld accurate classfers snce unlabeled data s easy to get from the Web. In [16] t has been demonstrated that an unlabeled set mproves classfcaton accuracy sgnfcantly wth only a small labeled tranng set. However, the Bayesan method used n [16] assumes that text documents are generated from a mxture model and there s a one-to-one correspondence between the mxture components and the classes. Ths may not be the case n many applcatons. In many real-lfe applcatons, a class may cover documents from many dfferent topcs, whch volates the oneto-one correspondence assumpton. In such cases, the resultng classfers can be qute poor. In ths paper, we propose a clusterng based parttonng technque to solve the problem. Ths method frst parttons the tranng documents n a herarchcal fashon usng hard clusterng. After runnng the expectaton maxmzaton (EM) algorthm n each partton, t prunes the tree usng the labeled data. The remanng tree nodes or parttons are lkely to satsfy the oneto-one correspondence condton. Extensve experments demonstrate that ths method s able to acheve a dramatc gan n classfcaton performance. Keywords. Text mnng, text classfcaton, sem-supervsed learnng, labeled and unlabeled data 1 Introducton Wth the ever-ncreasng volume of text data from varous onlne sources, t s an mportant task to categorze or classfy these text documents nto categores that are manageable and easy to understand. Text categorzaton or classfcaton ams to automatcally assgn categores or classes to unseen text documents. The task s commonly descrbed as follows: Gven a set of labeled tranng documents of n classes, the system uses ths tranng set to buld a classfer, whch s then used to classfy new documents nto the n classes. The problem has been studed extensvely n nformaton retreval, machne learnng and natural language processng. Past research has produced many text classfcaton technques, e.g., naïve Bayesan classfer [16], k-nearest neghbor [18], and support vector machnes [10]. These

2 exstng technques have been used to automatcally catalog news artcles [13], classfy Web pages [5] and learn the readng nterests of users [12]. An automatc text classfer can save consderable tme and human effort, partcularly when adng human ndexers who have already produced a large database of categorzed documents. However, the man drawback of these classc technques s that a large number of labeled tranng documents are needed n order to buld accurate classfcaton systems. The labelng s often done manually, whch s very labor ntensve and tme consumng. Recently, Ngam et al. [16] proposed to use a small set of labeled data and a large set of unlabeled data to help deal wth the problem. They showed that the unlabeled set s able to mprove the classfcaton accuracy substantally, and the number of labeled set can be extremely small. Ths approach thus saves labor, as the unlabeled set s often easy to obtan from sources such as the Web and UseNet. Ngam et al. [16] proposed a technque that utlzes the EM (Expectaton Maxmzaton) algorthm [6] to learn a classfer from both labeled and unlabeled data. The EM algorthm s a class of teratve algorthms for maxmum lkelhood estmaton n problems wth mssng data. Its obectve s to estmate the values of the mssng data (or a dstrbuton over the mssng values) usng exstng values. In the context of learnng from labeled and unlabeled sets, the EM algorthm tres to estmate the class labels of the unlabeled data. In each teraton, the EM algorthm performs value estmaton by makng use of the naïve Bayesan classfcaton method [16]. The method works qute well when the data conforms to ts generatve assumptons (see next paragraph). However, snce the algorthm n [16] makes use of the naïve Bayesan method, t suffers for the shortcomngs of the naïve Bayesan method. In partcular, n devsng the Bayesan method for text classfcaton, two assumptons are made: (1) Text documents are generated by a mxture model and there s a one-to-one mappng between mxture components and classes; (2) Document features are ndependent gven the class. Many researchers have shown that the Bayesan classfer performs surprsngly well despte the obvous volaton of (2). However, (1) often causes dffculty when t does not hold. In many real-lfe stuatons, one-to-one correspondence of mxture components and classes does not hold. That s, a class (or category) may cover a number of sub-topcs. Several examples are gven as follows: Gven one s bookmarks, one wants to fnd all the nterestng documents rom a document collecton. We can buld a bnary classfer for the purpose. However, the set of documents n the bookmarks are unlkely to belong to a sngle topc because one s often nterested n many thngs, e.g., those related to one s work and those related to one s hobbes. The labeled negatve documents are also unlkely to have come from a sngle topc. Junk e-mal flterng. Flterng unk e-mals s a typcal bnary classfcaton problem ( unk and normal e-mals) n whch each class contans multple sub-topcs. In these cases, the classfcaton accuracy usng EM algorthm suffers badly. In [16], attempts are made to relax ths restrcton. That s, t allows many mxture components to one class (each component represents a partcular sub-topc). To determne the number of mxture components for a partcular class, t uses crossvaldaton on the tranng data. However, expermental results on Reuters data n [16]

3 show that t s unsatsfactory. It has almost the same performance as the naïve Bayesan method wthout usng unlabeled data. Our expermental results on Reuters data and 20 Newsgroup (n secton 3) also show that the method n [16] of choosng the number of mxture component through cross-valdaton hardly mprove the performance of naïve Bayesan. In ths paper, we propose an alternatve method for handlng the dffculty n the case of two-class classfcaton problems. We perform herarchcal clusterng of the data to obtan a tree-based partton of the nput space (tranng documents). The dea s that f we partton the nput space nto small enough portons, a smple twocomponent mxture wll be a good enough approxmaton n each partton. We then utlze the labeled data n order to prune the tree so that for each node of the pruned tree (partton of the nput space), there s enough tranng data for satsfactory performance. The advantage of usng herarchcal clusterng s that we do not need to know a sutable number of components n advance. Instead, we only need to partton the nput space nto small enough components such that prunng can be used to effectvely fnd the correct szed components. Each test document to be classfed s frst clustered nto a leaf node of the pruned herarchcal tree from top to bottom, then we use the classfer bult n the leaf node for classfcaton. We also ntroduce another nnovaton n the use of early stoppng for the EM algorthm. We found that when the two-component mxture model s a poor model for the data, runnng EM gves much poorer result than smply usng the naïve Bayesan classfer (that does not take advantage of the unlabeled data). We use a smple test based on cross-valdaton to decde when to stop the teratons of the EM algorthm for each partton of the nput space. In ths way, we gan performance when the model s reasonable for the partton but do not suffer f the model s poor. So far, extensve experments have been conducted showng that the proposed method s able to mprove the classfcaton dramatcally wth a vared number of unlabeled data. The rest of the paper s organzed as follows. After revewng related work, we present our clusterng-based method to solve the multple mxture component problems n secton 2. In secton 3, we emprcally evaluate the proposed technque usng two document text collectons,.e., the Reuters data and the 20 newsgroup data. Fnally, we gve the conclusons n secton More Related Work The most closely related work to ours s [16] as dscussed above. Another popular technque usng unlabeled data s co-tranng [1, 7, 8, 15, 17], whch use two complementary predctors to teratvely label the unlabeled data. Clearly, these cotranng based approaches are dfferent from our proposed technque and are not specally desgned to handle classfcaton problem wth multple component n each class. Other methods for usng unlabeled data to mprove classfcaton nclude work on transducton usng support vector machnes [11], the maxmum entropy dscrmnaton method [9], and transformaton of the nput feature spaces usng nformaton of the unlabeled data [19]. The unlabeled data s also appled n herarchcal text classfcaton [2], and large mult-class text classfcaton [7].

4 2 Classfcaton Technques We frst descrbe the naïve Bayesan classfer and EM algorthm followed by our parttonng, prunng and early stoppng technques. We assume throughout ths paper that we are dealng wth a two-class classfcaton problem. 2.1 Naïve Bayesan Classfer Nave Bayesan method s one of the most popular technques for text classfcaton. It has been shown to perform extremely well n practce by many researchers [14]. Gven a set of tranng documents D, each document s consdered an ordered lst of words. We use w d,k to denote the word n poston k of document d, where each word s from the vocabulary V = < w 1, w 2,, w v >. The vocabulary s the set of all words we consder for classfcaton. We also have a set of pre-defned classes, C={c 1, c 2,, c C } (n ths paper we only consder two class classfcaton, so, C={c 1, c 2 }). In order to perform classfcaton, we need to compute the posteror probablty, P(c d ), where c s a class and d s a document. Based on the Bayesan probablty and the multnomal model, we have D ( c d ) 1 ( c ) = Ρ Ρ = (1) D and wth Laplacan smoothng, 1+ Ρ( wt c ) = V V + where N(w t,d ) s the count of the number of tmes the word w t occurs n document d and where P(c d ) {0,1} depends on the class label of the document. Fnally, assumng that the probabltes of the words are ndependent gven the class, we obtan d Ρ( c ) k= 1 Ρ( wd, k c ) Ρ( c d ) = (3) C d Ρ( c ) Ρ( w c ) D = 1 r= 1 r k = 1 d, k In the nave Bayesan classfer, the class wth the hghest P(c d ) s assgned as the class of the document. N( w, d ) Ρ( c D s= 1 = 1 t N( w, d ) Ρ( c s r d ) d ) (2) 2.2 The EM Algorthm The subsecton descrbes brefly the basc EM and ts extenson wthn the probablstc framework of nave Bayesans text classfcaton. We refer nterested readers to [16] for detals.

5 2.2.1 Basc EM The Expectaton-Maxmzaton (EM) algorthm [6] s a popular class of teratve algorthms for maxmum lkelhood estmaton n problems wth ncomplete data. It can be used to fll the mssng value n the data usng exstng values n the data by computng the expected value for each mssng value. The EM algorthm conssts of two steps, the Expectaton step, and the Maxmzaton step. The Expectaton step bascally flls n the mssng data. The parameters are estmated n the Maxmzaton step after the mssng data are flled or reconstructed. Ths leads to the next teraton of the algorthm. We use the naïve Bayesan classfer that s traned usng only the labeled examples as the startng pont for EM. For the nave Bayesan classfer, the steps used by EM are dentcal to that used to buld the classfer, equaton (3) for the Expectaton step, and equatons (1) and (2) for the Maxmzaton step. Note that gven the document, the probablty of the class now takes the value n [0,1] for the unlabeled data and n {0,1} for the labeled data. The basc EM approach depends on the assumpton that there s a one-to-one correspondence between mxture components and classes. When the assumpton does not hold, the unlabeled data often harm the accuracy of a classfer [16]. [16] proposes the multple mxture components per class technque, n whch the above assumpton s replaced wth a many-to-one correspondence between mxture components and classes Multple Mxture Components per Class (M-EM) We call EM wth multple mxture components per class M-EM. M-EM assumes that a class may be comprsed of several dfferent sub-topcs, and thus usng multple mxture components mght capture some dependences between words for the classes. Due to space lmtaton, nterested readers refer to [16] or our full paper for M-EM. One crucal problem of M-EM s how to determne the number of mxture components. [16] used leave-one-out (see Secton 2.3 for the meanng) crossvaldaton for the purpose. But n practce such a cross-valdaton approach usually cannot fnd the optmal number of mxture components. Moreover, M-EM s senstve to the number of mxture components. As a result, M-EM showed hardly any mprovement over NB n experments of [16]. 2.3 Classfcaton based on Hdden Clusters (CHC) Our classfcaton technque has three mportant components: herarchcal clusterng (parttonng), prunng and early stoppng Herarchcal Clusterng Runnng EM wth the naïve Bayesan classfer wth two mxture components fals badly when the actual number of components s larger than two. Our man dea s to recursvely partton the nput space untl the number of components n each partton s no more than two. We choose to do each stage of the parttonng usng hard clusterng wth a two-component naïve Bayesan model. The motvaton for usng herarchcal clusterng for parttonng comes from the observaton that smaller

6 components n a mxture can often be clustered nto larger components. The large components can hence be recursvely dvded untl each partton contans no more than two components so that EM wth a two-component mxture works well. The followng example llustrates a stuaton where EM wth a two-component mxture fals but herarchcal clusterng works well. Consder a stuaton where we have four newsgroups baseball, IBM hardware, hockey and Macntosh hardware. Assume that a user s nterested n baseball and IBM hardware (postve class) but not the other classes (negatve class). In ths case, because both the postve and negatve classes contan two components, the twocomponent mxture model s not sutable for the data. However, suppose we frst cluster the tranng documents nto two clusters (parttons); f the clusterng method s good enough, most documents about IBM hardware and Macntosh hardware should be clustered nto one cluster and most documents about baseball and hockey should be n another cluster. Now wthn each cluster, both the postve and negatve classes manly contan documents only from one category. Thus n each Orgnal tranng data: Postve rec.sport.baseball comp.sys.bm.pc.hardware Negatve rec.sport.hockey comp.sys.mac.hardware Partton data nto two clusters: Cluster 1 Cluster 2 rec.sport.baseball comp.sys.mac.hardware rec.sport.hockey comp.sys.bm.pc.hardware Buld text classfer n each cluster Classfer 1 Classfer 2 Postve Negatve Postve Negatve rec.sport.baseball rec.sport.hockey comp.sys.bm.pc.hardware comp.sys.mac.hardware Fg. 1. An example of usng clusterng method to help mprove data model for classfcaton partton, the assumpton of two-component mxture model s bascally satsfed and we can use EM combnng the labeled and unlabeled data of ths cluster to buld a text classfer. Fg. 1 shows the dea for the above example. For the newsgroups and classes used n ths example, we fnd that on a tranng set of 40 labeled examples, 2360 unlabeled examples and a test set of 1600 examples, naïve Bayesan wth a two-component mxtures acheves an accuracy of 68% whle runnng EM wth the help of the unlabeled examples reduces the accuracy to 59.6%. On the other hand, f we partton the space nto two before runnng EM, we acheve an accuracy of 83.1%. We use hard clusterng (also called vector quantzaton) wth the two-component naïve Bayesan model to recursvely partton the nput space. The labeled data s ncluded n the parttonng algorthm but the labels are not used. Ths s because the labels can sometmes be unhelpful for parttonng the data nto two components as n the case n Fg. 1, where addton of the labels wll only confuse the algorthm. The

7 algorthm s shown n Fg. 2. Lne 4) uses the Equaton (1) and (2) and lne 5) uses the Equaton (3). The hard clusterng method n each node s guaranteed to converge to the maxmum of the log lkelhood functon n a fnte number of steps. Cluster-Partton (Node A) 1) If node A contans no more than 2 labeled documents, then stop 2) Randomly partton A nto two sets L and R 3) Repeat untl convergence 4) Buld a naïve Bayesan model assumng that L and R contan data for two classes. 5) Reassgn each document to L and R based on the classfcaton of the naïve Bayesan model. 6) Cluster-Partton (L) 7) Cluster-Partton (R) Fg. 2. Algorthm for recursve parttonng based on hard clusterng Prunng We run the recursve parttonng algorthm untl the nodes contan no more than 2 labeled examples and then perform prunng to merge parttons. The prunng step s requred because of the followng two reasons: frst, some parttons are too small; Second, we notce that when a partton satsfes the two-component mxture model further parttonng usually deterorates the accuracy of classfer. We use the tranng error on the labeled set as the prunng crteron under the assumpton that f the EM algorthm on the two-component mxture works well, the tranng error on the labeled set wll be small; but f t does not work well, the tranng error wll be large. Frst, EM wth the two-component mxture model s run on every node n the tree (ntalzed usng the labeled data). Prunng s then performed n a bottom-up fashon from the leaves on. The algorthm s gven n Fg. 3: Prune-Partton (Node N) 1) Prune-Partton (the left chldren of N); 2) Prune-Partton (the rght chldren of N); 3) If the number of errors E P, on the labeled set n the node, s no more than the combned number of errors E C, on the labeled sets n ts two chldren; 4) then the chldren are pruned away. 5) else keep the chldren, and substtute E P by E C. Fg. 3. Algorthm for prunng parttons Early Stoppng After prunng, we are left wth a parttoned nput space where we want to run EM wth a two-component mxture model n each partton. For some data sets, even the parttoned nput space s stll not perfectly sutable for two-component mxture model. When the model s ncorrect, we notce that the generalzaton performance of the algorthm gets worse as the number of teratons of EM ncreases. In such a case, t s often better to use the naïve Bayesan classfer or to run only a small number of teratons of EM nstead of runnng EM to convergence. Moreover, dfferent number of teratons of EM may be requred for dfferent parttons of the nput space.

8 We use leave-one-out cross valdaton to estmate the generalzaton error at each teraton of EM. For the naïve Bayesan classfer, ths can be effcently calculated by subtractng the word counts of each document from the model before t s tested. For EM, we use the same approxmaton as used n [16] n order to have an effcent mplementaton. In the approxmaton, all the data (labeled and unlabeled) are used n the EM algorthm. In each teraton, before a labeled document s tested, the word counts of the document are subtracted from the model ust lke n the naïve Bayesan algorthm. We get: D 1 + (, ) ( ) (, ) ( ) ' N w Ρ = 1 t d c d N wt d v P c d v Ρ ( wt c ) = (4) V D V + ( N( w, d ) Ρ( c d ) N( w, d ) P( c d )) s= 1 = 1 s For each labeled document d v, we calculate P(c d v ) usng Equaton (3) by replacng P(w t c ) wth P (w t c ) and determne ts class. We then can compute the classfcaton accuracy of the resultng classfer for each EM teraton on the set of labeled documents. EM wll stop when the accuracy decreases. t v v Usng the CHC classfer For a test document to be classfed, the CHC classfer bult from tranng data usng the methods gven n the above s used at the followng two steps: (1) We frst cluster the documents to one leaf node of the herarchy obtaned n Secton from top to bottom. At each node of the herarchy (except the leaf nodes), there s a clusterng model obtaned from tranng data whch determnes whether the test document belongs to ts left sub-cluster or rght sub-cluster. In ths way, each test document wll be clustered nto the only leaf node. The classfer bult at the leaf node wll be used to classfy the test document at the next step. (2) We then classfy the document usng Equaton (3) wth the parameters obtaned n Secton Experments 3.1 Expermental Setup The emprcal evaluaton s done on four groups of data, each group consstng of many datasets. Two data groups are generated from the 20 newsgroups collecton, whch conssts of artcles dvded almost evenly among 20 dfferent UseNet dscusson topcs. The other two data groups used n our evaluaton are generated from the Reuters collecton, whch s composed of artcles coverng 135 potental topc categores. The four data groups are lsted as follows (detals are gven n our full paper due to space lmtaton): 20A: It s composed of 20 datasets. Each postve set s one of 20 newsgroups and the negatve set s composed of the other 19 newsgroups.

9 20B: It s composed of 20 datasets. Each of the postve sets conssts of randomly selected topcs from the 20 newsgroups collecton wth the rest of the topcs becomng the negatve sets. RA: It s composed of 10 datasets. Each postve set s one of the 10 most populous categores and each negatve set conssts of the rest of the documents n Reuters collecton. RB: It s composed of 20 datasets. Each postve set conssts of randomly selected categores from the 10 most populous categores from the Reuters collecton. Each negatve set conssts of the rest of the categores. For all four data groups, we remove stop words, but do not perform stemmng. We also remove those rare words from feature set that appear n fewer than 3 documents n the whole set of documents for both the 20 newsgroups and Reuters collectons. Results on all four data groups are reported as averages over 10 random selectons of labeled sets. For data group 20A and 20B, each test set contans 4000 documents, each unlabeled set contans documents, and the labeled sets wth dfferent szes (as shown n Fgures 4 & 5) are selected randomly from the remanng documents. For each dataset n data groups RA and RB, we use the ModApte tran/test splt, leadng to a corpus of 9603 tranng documents and 3299 test documents. In each run on the Reuters collecton, 8000 documents are selected randomly from the 9603 tranng documents as unlabeled set and labeled sets of dfferent szes are selected from the remanng documents. We compare the performance of the followng technques: Naïve Bayesan classfer (NB) EM run to convergence (EM) EM wth early stoppng (E-EM) but wthout parttonng the space Classfcaton based on Hdden Clusters (CHC),.e., EM wth parttonng usng hard clusterng and early stoppng Multple mxture components EM (M-EM). For M-EM, [16] performed experments by assumng one component for the postve class and multple components for the negatve class. The leave-one-out crossvaldaton method s used to determne the number of mxture components. We test ths technque on the datasets 20A and RA whose postve classes are generated by a sngle mxture component whle negatve has multple sub-topcs. The canddate number of mxture components s set from 1 to 20. We set ths range snce the experment results n [16] show that both the optmal selecton and the selecton va cross-valdaton of components for dataset RA are always less than 20 for dataset RA. For dataset 20A, we also set 20 as the upper bound for cross-valdaton snce the dataset contans 20 sub-topcs. We thnk that t s dffcult to extend M-EM to the case that both the negatve and postve classes consst of multple mxture components snce [16] does not descrbe how to do cross-valdaton to obtan the number of mxture components for the case. Thus, we do not test the technque for the other three datasets n our experments. We use F 1 score (see e.g. [3]) to evaluate the effectveness of our technque on the datasets n 20A, 20B, RA and RB. F 1 score s computed as follows: F 1 = 2pr/(p + r), where p s the precson and r s the recall (n our case, they are measured on the

10 Average F1 Score(%) Fg. 4. Average F 1 score of datasets n 20A Average F1 Score(%) NB EM E-EM M-EM CHC Number of labeled documents NB EM E-EM M-EM CHC Number of labeled documents Average F1 Score(%) Fg. 5. Average F 1 score of datasets n 20B Average F1 Score(%) NB EM 30 E-EM 20 CHC Number of labeled documents 40 NB EM 30 E-EM CHC Number of labeled documents Fg. 6. Average F 1 score of datasets n RA Fg. 7. Average F 1 score of datasets n RB postve class, whch s the class that we are nterested n). Notce that accuracy s not a good performance metrc for datasets n 20A, 20B, RA and RB snce hgh accuracy can often be acheved by always predctng the negatve class. 3.2 Expermental Results and Analyses Fgures 4, 5, 6 and 7 show the average F 1 score on datasets n data groups 20A, 20B, RA and RB. The vertcal axs shows the average F 1 scores on each data group and the horzontal axs ndcates the number of labeled documents. Tables 1 and 2 show the F 1 score for the postve category of each dataset wth 200 labeled documents on data groups 20A and 20B respectvely. Due to space lmtaton, we do not gve the detaled results of data groups RA and RB. CHC performs sgnfcantly better than other methods, especally when the number of labeled documents s small. For example, on data group 20A, wth 200 labeled documents, on average, NB gets F 1 -score, EM gets 0.082, E-EM acheves 0.253, M-EM gets whle CHC acheves Ths represents a 75.4% ncrease

11 Table 1. Results of datasets n 20A wth 200 labeled documents F 1-score Set NB EM E-EM M-EM CHC Ave Table 2. Results of datasets n 20B wth 200 labeled documents F 1-score Set NB EM E-EM CHC Ave of F 1 score of CHC aganst NB, a 439% ncrease aganst EM, a 74.7% ncrease aganst E-EM and a 46.4% ncrease aganst M-EM. Fgures 4 and 6 show that M-EM do not consstently mprove the performance of NB. In fact, M-EM bascally acheved smlar results as NB wth small numbers of labeled documents. Ths s consstent wth the results reported n [16], where experments usng 50 labeled documents also showed that sometmes M-EM works better than NB whle at other tmes t does not. One nterestng fndng of our experments s that E-EM can mprove the performance of NB when the number of labeled documents s small. Ths means that EM may mprove the performance of NB at the frst few teratons wth small label sets. The performance can deterorate wth addtonal teratons as can be seen from the results of NB, (converged) EM and E-EM n Fgures 4 to 7. Experments on the frst four data groups show that converged EM does not mprove the performance of NB n general. Ths s because that the underlyng model of EM does not ft the data that covers multple categores. The reason for CHC to acheve better performance than NB whle EM s worse than NB s that CHC gets better components (satsfes smple two-component mxture) by clusterng and prunng. We also do experments on a group of datasets whose postve set and negatve set s one category of 20 newsgroups respectvely,.e. the two-component model fts the data well. We fnd that EM s able to perform better than NB n the case, and we also observe that our CHC method can acheve smlar accuracy as EM n ths case. Due to space lmtaton, we do not report detaled results here. In summary, we can conclude that no matter what the characterstcs of the datasets are, CHC does much better than or as well as other technques.

12 4 Conclusons We proposed a new method for mprovng the performance of EM wth the naïve Bayesan classfer for the problem of learnng wth labeled and unlabeled examples when each class s naturally composed of many clusters. Our method s based on recursve parttonng and prunng wth the am of parttonng the nput space so that wthn each partton, there s a one-to-one correspondence between the classes and the clusters n the partton. Extensve experments show that the proposed method sgnfcantly mproves the performance of the EM wth naïve Bayesan classfer. References 1. Blum, A., Mtchell, T. Combnng labeled and unlabeled data wth co-tranng. Proceedngs of COLT-98, (1998) Boyapat,V. Improvng herarchcal text classfcaton usng unlabeled data. Proceedngs of SIGIR, (2002) 3. Bollmann, P., Chernavsky, V. Measurement-theoretcal nvestgaton of the mz-metrc. Informaton Retreval Research (1981) Cohen, W. Automatcally extractng features for concept learnng from the Web. Proceedngs of the ICML, (2000) 5. Craven, M., DPasquo, D., Fretag, D., MaCallum, A., Mtchell, T., Ngam, K., Slattery, S. Learnng to extract symbolc knowledge from the World Wde Web. Proceedngs of AAAI- 98, (1998) Dempster, A., Lard, N., Rubn, D. Maxmum lkelhood from ncomplete data va the EM algorthm. Journal of the Royal Statstcal Socety, Seres B, 39(1), (1977) R. Ghan, Combnng Labeled and Unlabeled Data for MultClass Text Categorzaton. In Proceedngs of the ICML, (2002) 8. Goldman, S., Zhou, Y., enhanced supervsed learnng wth unlabeled data. In Proceedngs of the ICML, (2000) 9. Jaakkola, T., Mela, M., Jebara, T. Maxmum entropy dscrmnaton. Advances n Neural Informaton Pcocessng Systems 12, (2000) Joachms, T. (1998). Text categorzaton wth Support Vector Machnes: learnng wth many relevant features. Proceedngs of ECML-98, (1998) Joachms, T. (1999). Transductve nference for text classfcaton usng support vector machnes. Proceedngs of ICML-99,(1999) Lang, K. Newsweeder: Learnng to flter netnews. Proceedngs of ICML, (1995) Lews, D. Gale, W. A sequental algorthm for tranng text classfers. Proceedngs of SIGIR-94, (1994) McCallum, A., Ngam, K. A comparson of event models for naïve Bayes text classfcaton. AAAI-98 Workshop on Learnng for Text Categorzaton. AAAI Press (1998) 15. Ngam, K., Ghan, R. Analyzng the effectveness and applcablty of co-tranng. Nnth Internatonal Conference on Informaton and Knowledge Management (2000) Ngam, K., McCallum, A., Thrun, S., Mtchell, T. Text Classfcaton from Labeled and Unlabeled Documents usng EM. Machne Learnng, 39(2/3), (2000) Raskutt, B. Ferra, H. Kowalczyk, A. Combnng Clusterng and Co-tranng to Enhance Text Classfcaton Usng Unlabelled Data. In Proceedngs of the KDD, (2002) 18. Yang, Y. An evaluaton of statstcal approaches to text categorzaton. Journal of Informaton Retreval, 1, 67-88, (1999) 19. Zelkovtz, S. Hrsh, H. usng LSI for text classfcaton n the presence of background text. In Proceedngs of the CIKM, (2001)

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis The Development of Web Log Mnng Based on Improve-K-Means Clusterng Analyss TngZhong Wang * College of Informaton Technology, Luoyang Normal Unversty, Luoyang, 471022, Chna wangtngzhong2@sna.cn Abstract.

More information

What is Candidate Sampling

What is Candidate Sampling What s Canddate Samplng Say we have a multclass or mult label problem where each tranng example ( x, T ) conssts of a context x a small (mult)set of target classes T out of a large unverse L of possble

More information

Forecasting the Direction and Strength of Stock Market Movement

Forecasting the Direction and Strength of Stock Market Movement Forecastng the Drecton and Strength of Stock Market Movement Jngwe Chen Mng Chen Nan Ye cjngwe@stanford.edu mchen5@stanford.edu nanye@stanford.edu Abstract - Stock market s one of the most complcated systems

More information

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications CMSC828G Prncples of Data Mnng Lecture #9 Today s Readng: HMS, chapter 9 Today s Lecture: Descrptve Modelng Clusterng Algorthms Descrptve Models model presents the man features of the data, a global summary

More information

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification Lecture 4: More classfers and classes C4B Machne Learnng Hlary 20 A. Zsserman Logstc regresson Loss functons revsted Adaboost Loss functons revsted Optmzaton Multple class classfcaton Logstc Regresson

More information

Lecture 2: Single Layer Perceptrons Kevin Swingler

Lecture 2: Single Layer Perceptrons Kevin Swingler Lecture 2: Sngle Layer Perceptrons Kevn Sngler kms@cs.str.ac.uk Recap: McCulloch-Ptts Neuron Ths vastly smplfed model of real neurons s also knon as a Threshold Logc Unt: W 2 A Y 3 n W n. A set of synapses

More information

The OC Curve of Attribute Acceptance Plans

The OC Curve of Attribute Acceptance Plans The OC Curve of Attrbute Acceptance Plans The Operatng Characterstc (OC) curve descrbes the probablty of acceptng a lot as a functon of the lot s qualty. Fgure 1 shows a typcal OC Curve. 10 8 6 4 1 3 4

More information

Single and multiple stage classifiers implementing logistic discrimination

Single and multiple stage classifiers implementing logistic discrimination Sngle and multple stage classfers mplementng logstc dscrmnaton Hélo Radke Bttencourt 1 Dens Alter de Olvera Moraes 2 Vctor Haertel 2 1 Pontfíca Unversdade Católca do Ro Grande do Sul - PUCRS Av. Ipranga,

More information

Mining Multiple Large Data Sources

Mining Multiple Large Data Sources The Internatonal Arab Journal of Informaton Technology, Vol. 7, No. 3, July 2 24 Mnng Multple Large Data Sources Anmesh Adhkar, Pralhad Ramachandrarao 2, Bhanu Prasad 3, and Jhml Adhkar 4 Department of

More information

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network 700 Proceedngs of the 8th Internatonal Conference on Innovaton & Management Forecastng the Demand of Emergency Supples: Based on the CBR Theory and BP Neural Network Fu Deqang, Lu Yun, L Changbng School

More information

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur Module LOSSLESS IMAGE COMPRESSION SYSTEMS Lesson 3 Lossless Compresson: Huffman Codng Instructonal Objectves At the end of ths lesson, the students should be able to:. Defne and measure source entropy..

More information

320 The Internatonal Arab Journal of Informaton Technology, Vol. 5, No. 3, July 2008 Comparsons Between Data Clusterng Algorthms Osama Abu Abbas Computer Scence Department, Yarmouk Unversty, Jordan Abstract:

More information

An Alternative Way to Measure Private Equity Performance

An Alternative Way to Measure Private Equity Performance An Alternatve Way to Measure Prvate Equty Performance Peter Todd Parlux Investment Technology LLC Summary Internal Rate of Return (IRR) s probably the most common way to measure the performance of prvate

More information

An Interest-Oriented Network Evolution Mechanism for Online Communities

An Interest-Oriented Network Evolution Mechanism for Online Communities An Interest-Orented Network Evoluton Mechansm for Onlne Communtes Cahong Sun and Xaopng Yang School of Informaton, Renmn Unversty of Chna, Bejng 100872, P.R. Chna {chsun,yang}@ruc.edu.cn Abstract. Onlne

More information

Web Spam Detection Using Machine Learning in Specific Domain Features

Web Spam Detection Using Machine Learning in Specific Domain Features Journal of Informaton Assurance and Securty 3 (2008) 220-229 Web Spam Detecton Usng Machne Learnng n Specfc Doman Features Hassan Najadat 1, Ismal Hmed 2 Department of Computer Informaton Systems Faculty

More information

Title Language Model for Information Retrieval

Title Language Model for Information Retrieval Ttle Language Model for Informaton Retreval Rong Jn Language Technologes Insttute School of Computer Scence Carnege Mellon Unversty Alex G. Hauptmann Computer Scence Department School of Computer Scence

More information

Logistic Regression. Steve Kroon

Logistic Regression. Steve Kroon Logstc Regresson Steve Kroon Course notes sectons: 24.3-24.4 Dsclamer: these notes do not explctly ndcate whether values are vectors or scalars, but expects the reader to dscern ths from the context. Scenaro

More information

Bayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending

Bayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending Proceedngs of 2012 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 25 (2012) (2012) IACSIT Press, Sngapore Bayesan Network Based Causal Relatonshp Identfcaton and Fundng Success

More information

Document Clustering Analysis Based on Hybrid PSO+K-means Algorithm

Document Clustering Analysis Based on Hybrid PSO+K-means Algorithm Document Clusterng Analyss Based on Hybrd PSO+K-means Algorthm Xaohu Cu, Thomas E. Potok Appled Software Engneerng Research Group, Computatonal Scences and Engneerng Dvson, Oak Rdge Natonal Laboratory,

More information

1 Example 1: Axis-aligned rectangles

1 Example 1: Axis-aligned rectangles COS 511: Theoretcal Machne Learnng Lecturer: Rob Schapre Lecture # 6 Scrbe: Aaron Schld February 21, 2013 Last class, we dscussed an analogue for Occam s Razor for nfnte hypothess spaces that, n conjuncton

More information

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence 1 st Internatonal Symposum on Imprecse Probabltes and Ther Applcatons, Ghent, Belgum, 29 June 2 July 1999 How Sets of Coherent Probabltes May Serve as Models for Degrees of Incoherence Mar J. Schervsh

More information

SPEE Recommended Evaluation Practice #6 Definition of Decline Curve Parameters Background:

SPEE Recommended Evaluation Practice #6 Definition of Decline Curve Parameters Background: SPEE Recommended Evaluaton Practce #6 efnton of eclne Curve Parameters Background: The producton hstores of ol and gas wells can be analyzed to estmate reserves and future ol and gas producton rates and

More information

Searching for Interacting Features for Spam Filtering

Searching for Interacting Features for Spam Filtering Searchng for Interactng Features for Spam Flterng Chuanlang Chen 1, Yun-Chao Gong 2, Rongfang Be 1,, and X. Z. Gao 3 1 Department of Computer Scence, Bejng Normal Unversty, Bejng 100875, Chna 2 Software

More information

An Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services

An Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services An Evaluaton of the Extended Logstc, Smple Logstc, and Gompertz Models for Forecastng Short Lfecycle Products and Servces Charles V. Trappey a,1, Hsn-yng Wu b a Professor (Management Scence), Natonal Chao

More information

Improved SVM in Cloud Computing Information Mining

Improved SVM in Cloud Computing Information Mining Internatonal Journal of Grd Dstrbuton Computng Vol.8, No.1 (015), pp.33-40 http://dx.do.org/10.1457/jgdc.015.8.1.04 Improved n Cloud Computng Informaton Mnng Lvshuhong (ZhengDe polytechnc college JangSu

More information

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements Lecture 3 Densty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 5329 Sennott Square Next lecture: Matlab tutoral Announcements Rules for attendng the class: Regstered for credt Regstered for audt (only f there

More information

Probabilistic Latent Semantic User Segmentation for Behavioral Targeted Advertising*

Probabilistic Latent Semantic User Segmentation for Behavioral Targeted Advertising* Probablstc Latent Semantc User Segmentaton for Behavoral Targeted Advertsng* Xaohu Wu 1,2, Jun Yan 2, Nng Lu 2, Shucheng Yan 3, Yng Chen 1, Zheng Chen 2 1 Department of Computer Scence Bejng Insttute of

More information

Abstract. Clustering ensembles have emerged as a powerful method for improving both the

Abstract. Clustering ensembles have emerged as a powerful method for improving both the Clusterng Ensembles: {topchyal, Models jan, of punch}@cse.msu.edu Consensus and Weak Parttons * Alexander Topchy, Anl K. Jan, and Wllam Punch Department of Computer Scence and Engneerng, Mchgan State Unversty

More information

Recurrence. 1 Definitions and main statements

Recurrence. 1 Definitions and main statements Recurrence 1 Defntons and man statements Let X n, n = 0, 1, 2,... be a MC wth the state space S = (1, 2,...), transton probabltes p j = P {X n+1 = j X n = }, and the transton matrx P = (p j ),j S def.

More information

1. Measuring association using correlation and regression

1. Measuring association using correlation and regression How to measure assocaton I: Correlaton. 1. Measurng assocaton usng correlaton and regresson We often would lke to know how one varable, such as a mother's weght, s related to another varable, such as a

More information

Support Vector Machines

Support Vector Machines Support Vector Machnes Max Wellng Department of Computer Scence Unversty of Toronto 10 Kng s College Road Toronto, M5S 3G5 Canada wellng@cs.toronto.edu Abstract Ths s a note to explan support vector machnes.

More information

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ). REVIEW OF RISK MANAGEMENT CONCEPTS LOSS DISTRIBUTIONS AND INSURANCE Loss and nsurance: When someone s subject to the rsk of ncurrng a fnancal loss, the loss s generally modeled usng a random varable or

More information

Answer: A). There is a flatter IS curve in the high MPC economy. Original LM LM after increase in M. IS curve for low MPC economy

Answer: A). There is a flatter IS curve in the high MPC economy. Original LM LM after increase in M. IS curve for low MPC economy 4.02 Quz Solutons Fall 2004 Multple-Choce Questons (30/00 ponts) Please, crcle the correct answer for each of the followng 0 multple-choce questons. For each queston, only one of the answers s correct.

More information

Bayesian Cluster Ensembles

Bayesian Cluster Ensembles Bayesan Cluster Ensembles Hongjun Wang 1, Hanhua Shan 2 and Arndam Banerjee 2 1 Informaton Research Insttute, Southwest Jaotong Unversty, Chengdu, Schuan, 610031, Chna 2 Department of Computer Scence &

More information

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S S C H E D A E I N F O R M A T I C A E VOLUME 0 0 On Mean Squared Error of Herarchcal Estmator Stans law Brodowsk Faculty of Physcs, Astronomy, and Appled Computer Scence, Jagellonan Unversty, Reymonta

More information

Section 5.4 Annuities, Present Value, and Amortization

Section 5.4 Annuities, Present Value, and Amortization Secton 5.4 Annutes, Present Value, and Amortzaton Present Value In Secton 5.2, we saw that the present value of A dollars at nterest rate per perod for n perods s the amount that must be deposted today

More information

A Hierarchical Anomaly Network Intrusion Detection System using Neural Network Classification

A Hierarchical Anomaly Network Intrusion Detection System using Neural Network Classification IDC IDC A Herarchcal Anomaly Network Intruson Detecton System usng Neural Network Classfcaton ZHENG ZHANG, JUN LI, C. N. MANIKOPOULOS, JAY JORGENSON and JOSE UCLES ECE Department, New Jersey Inst. of Tech.,

More information

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

Luby s Alg. for Maximal Independent Sets using Pairwise Independence Lecture Notes for Randomzed Algorthms Luby s Alg. for Maxmal Independent Sets usng Parwse Independence Last Updated by Erc Vgoda on February, 006 8. Maxmal Independent Sets For a graph G = (V, E), an ndependent

More information

Hallucinating Multiple Occluded CCTV Face Images of Different Resolutions

Hallucinating Multiple Occluded CCTV Face Images of Different Resolutions In Proc. IEEE Internatonal Conference on Advanced Vdeo and Sgnal based Survellance (AVSS 05), September 2005 Hallucnatng Multple Occluded CCTV Face Images of Dfferent Resolutons Ku Ja Shaogang Gong Computer

More information

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION Vson Mouse Saurabh Sarkar a* a Unversty of Cncnnat, Cncnnat, USA ABSTRACT The report dscusses a vson based approach towards trackng of eyes and fngers. The report descrbes the process of locatng the possble

More information

Latent Class Regression. Statistics for Psychosocial Research II: Structural Models December 4 and 6, 2006

Latent Class Regression. Statistics for Psychosocial Research II: Structural Models December 4 and 6, 2006 Latent Class Regresson Statstcs for Psychosocal Research II: Structural Models December 4 and 6, 2006 Latent Class Regresson (LCR) What s t and when do we use t? Recall the standard latent class model

More information

Using Content-Based Filtering for Recommendation 1

Using Content-Based Filtering for Recommendation 1 Usng Content-Based Flterng for Recommendaton 1 Robn van Meteren 1 and Maarten van Someren 2 1 NetlnQ Group, Gerard Brandtstraat 26-28, 1054 JK, Amsterdam, The Netherlands, robn@netlnq.nl 2 Unversty of

More information

Fast Fuzzy Clustering of Web Page Collections

Fast Fuzzy Clustering of Web Page Collections Fast Fuzzy Clusterng of Web Page Collectons Chrstan Borgelt and Andreas Nürnberger Dept. of Knowledge Processng and Language Engneerng Otto-von-Guercke-Unversty of Magdeburg Unverstätsplatz, D-396 Magdeburg,

More information

Performance Analysis and Coding Strategy of ECOC SVMs

Performance Analysis and Coding Strategy of ECOC SVMs Internatonal Journal of Grd and Dstrbuted Computng Vol.7, No. (04), pp.67-76 http://dx.do.org/0.457/jgdc.04.7..07 Performance Analyss and Codng Strategy of ECOC SVMs Zhgang Yan, and Yuanxuan Yang, School

More information

The Greedy Method. Introduction. 0/1 Knapsack Problem

The Greedy Method. Introduction. 0/1 Knapsack Problem The Greedy Method Introducton We have completed data structures. We now are gong to look at algorthm desgn methods. Often we are lookng at optmzaton problems whose performance s exponental. For an optmzaton

More information

Calculation of Sampling Weights

Calculation of Sampling Weights Perre Foy Statstcs Canada 4 Calculaton of Samplng Weghts 4.1 OVERVIEW The basc sample desgn used n TIMSS Populatons 1 and 2 was a two-stage stratfed cluster desgn. 1 The frst stage conssted of a sample

More information

ECE544NA Final Project: Robust Machine Learning Hardware via Classifier Ensemble

ECE544NA Final Project: Robust Machine Learning Hardware via Classifier Ensemble 1 ECE544NA Fnal Project: Robust Machne Learnng Hardware va Classfer Ensemble Sa Zhang, szhang12@llnos.edu Dept. of Electr. & Comput. Eng., Unv. of Illnos at Urbana-Champagn, Urbana, IL, USA Abstract In

More information

Active Learning for Interactive Visualization

Active Learning for Interactive Visualization Actve Learnng for Interactve Vsualzaton Tomoharu Iwata Nel Houlsby Zoubn Ghahraman Unversty of Cambrdge Unversty of Cambrdge Unversty of Cambrdge Abstract Many automatc vsualzaton methods have been. However,

More information

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College Feature selecton for ntruson detecton Slobodan Petrovć NISlab, Gjøvk Unversty College Contents The feature selecton problem Intruson detecton Traffc features relevant for IDS The CFS measure The mrmr measure

More information

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm Avalable onlne www.ocpr.com Journal of Chemcal and Pharmaceutcal Research, 2014, 6(7):1884-1889 Research Artcle ISSN : 0975-7384 CODEN(USA) : JCPRC5 A hybrd global optmzaton algorthm based on parallel

More information

How To Calculate The Accountng Perod Of Nequalty

How To Calculate The Accountng Perod Of Nequalty Inequalty and The Accountng Perod Quentn Wodon and Shlomo Ytzha World Ban and Hebrew Unversty September Abstract Income nequalty typcally declnes wth the length of tme taen nto account for measurement.

More information

An Enhanced Super-Resolution System with Improved Image Registration, Automatic Image Selection, and Image Enhancement

An Enhanced Super-Resolution System with Improved Image Registration, Automatic Image Selection, and Image Enhancement An Enhanced Super-Resoluton System wth Improved Image Regstraton, Automatc Image Selecton, and Image Enhancement Yu-Chuan Kuo ( ), Chen-Yu Chen ( ), and Chou-Shann Fuh ( ) Department of Computer Scence

More information

Planning for Marketing Campaigns

Planning for Marketing Campaigns Plannng for Marketng Campagns Qang Yang and Hong Cheng Department of Computer Scence Hong Kong Unversty of Scence and Technology Clearwater Bay, Kowloon, Hong Kong, Chna (qyang, csch)@cs.ust.hk Abstract

More information

Can Auto Liability Insurance Purchases Signal Risk Attitude?

Can Auto Liability Insurance Purchases Signal Risk Attitude? Internatonal Journal of Busness and Economcs, 2011, Vol. 10, No. 2, 159-164 Can Auto Lablty Insurance Purchases Sgnal Rsk Atttude? Chu-Shu L Department of Internatonal Busness, Asa Unversty, Tawan Sheng-Chang

More information

Using Supervised Clustering Technique to Classify Received Messages in 137 Call Center of Tehran City Council

Using Supervised Clustering Technique to Classify Received Messages in 137 Call Center of Tehran City Council Usng Supervsed Clusterng Technque to Classfy Receved Messages n 137 Call Center of Tehran Cty Councl Mahdyeh Haghr 1*, Hamd Hassanpour 2 (1) Informaton Technology engneerng/e-commerce, Shraz Unversty (2)

More information

Enterprise Master Patient Index

Enterprise Master Patient Index Enterprse Master Patent Index Healthcare data are captured n many dfferent settngs such as hosptals, clncs, labs, and physcan offces. Accordng to a report by the CDC, patents n the Unted States made an

More information

THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION

THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION Internatonal Journal of Electronc Busness Management, Vol. 3, No. 4, pp. 30-30 (2005) 30 THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION Yu-Mn Chang *, Yu-Cheh

More information

) of the Cell class is created containing information about events associated with the cell. Events are added to the Cell instance

) of the Cell class is created containing information about events associated with the cell. Events are added to the Cell instance Calbraton Method Instances of the Cell class (one nstance for each FMS cell) contan ADC raw data and methods assocated wth each partcular FMS cell. The calbraton method ncludes event selecton (Class Cell

More information

Risk-based Fatigue Estimate of Deep Water Risers -- Course Project for EM388F: Fracture Mechanics, Spring 2008

Risk-based Fatigue Estimate of Deep Water Risers -- Course Project for EM388F: Fracture Mechanics, Spring 2008 Rsk-based Fatgue Estmate of Deep Water Rsers -- Course Project for EM388F: Fracture Mechancs, Sprng 2008 Chen Sh Department of Cvl, Archtectural, and Envronmental Engneerng The Unversty of Texas at Austn

More information

PREDICTION OF MISSING DATA IN CARDIOTOCOGRAMS USING THE EXPECTATION MAXIMIZATION ALGORITHM

PREDICTION OF MISSING DATA IN CARDIOTOCOGRAMS USING THE EXPECTATION MAXIMIZATION ALGORITHM 18-19 October 2001, Hotel Kontokal Bay, Corfu PREDICTIO OF MISSIG DATA I CARDIOTOCOGRAMS USIG THE EXPECTATIO MAXIMIZATIO ALGORITHM G. okas Department of Electrcal and Computer Engneerng, Unversty of Patras,

More information

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching) Face Recognton Problem Face Verfcaton Problem Face Verfcaton (1:1 matchng) Querymage face query Face Recognton (1:N matchng) database Applcaton: Access Control www.vsage.com www.vsoncs.com Bometrc Authentcaton

More information

Logical Development Of Vogel s Approximation Method (LD-VAM): An Approach To Find Basic Feasible Solution Of Transportation Problem

Logical Development Of Vogel s Approximation Method (LD-VAM): An Approach To Find Basic Feasible Solution Of Transportation Problem INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME, ISSUE, FEBRUARY ISSN 77-866 Logcal Development Of Vogel s Approxmaton Method (LD- An Approach To Fnd Basc Feasble Soluton Of Transportaton

More information

Cluster Analysis. Cluster Analysis

Cluster Analysis. Cluster Analysis Cluster Analyss Cluster Analyss What s Cluster Analyss? Types of Data n Cluster Analyss A Categorzaton of Maor Clusterng Methos Parttonng Methos Herarchcal Methos Densty-Base Methos Gr-Base Methos Moel-Base

More information

Review of Hierarchical Models for Data Clustering and Visualization

Review of Hierarchical Models for Data Clustering and Visualization Revew of Herarchcal Models for Data Clusterng and Vsualzaton Lola Vcente & Alfredo Velldo Grup de Soft Computng Seccó d Intel lgènca Artfcal Departament de Llenguatges Sstemes Informàtcs Unverstat Poltècnca

More information

Bag-of-Words models. Lecture 9. Slides from: S. Lazebnik, A. Torralba, L. Fei-Fei, D. Lowe, C. Szurka

Bag-of-Words models. Lecture 9. Slides from: S. Lazebnik, A. Torralba, L. Fei-Fei, D. Lowe, C. Szurka Bag-of-Words models Lecture 9 Sldes from: S. Lazebnk, A. Torralba, L. Fe-Fe, D. Lowe, C. Szurka Bag-of-features models Overvew: Bag-of-features models Orgns and motvaton Image representaton Dscrmnatve

More information

A Dynamic Load Balancing for Massive Multiplayer Online Game Server

A Dynamic Load Balancing for Massive Multiplayer Online Game Server A Dynamc Load Balancng for Massve Multplayer Onlne Game Server Jungyoul Lm, Jaeyong Chung, Jnryong Km and Kwanghyun Shm Dgtal Content Research Dvson Electroncs and Telecommuncatons Research Insttute Daejeon,

More information

Adaptive Fractal Image Coding in the Frequency Domain

Adaptive Fractal Image Coding in the Frequency Domain PROCEEDINGS OF INTERNATIONAL WORKSHOP ON IMAGE PROCESSING: THEORY, METHODOLOGY, SYSTEMS AND APPLICATIONS 2-22 JUNE,1994 BUDAPEST,HUNGARY Adaptve Fractal Image Codng n the Frequency Doman K AI UWE BARTHEL

More information

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12 14 The Ch-squared dstrbuton PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 1 If a normal varable X, havng mean µ and varance σ, s standardsed, the new varable Z has a mean 0 and varance 1. When ths standardsed

More information

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK Sample Stablty Protocol Background The Cholesterol Reference Method Laboratory Network (CRMLN) developed certfcaton protocols for total cholesterol, HDL

More information

Gender Classification for Real-Time Audience Analysis System

Gender Classification for Real-Time Audience Analysis System Gender Classfcaton for Real-Tme Audence Analyss System Vladmr Khryashchev, Lev Shmaglt, Andrey Shemyakov, Anton Lebedev Yaroslavl State Unversty Yaroslavl, Russa vhr@yandex.ru, shmaglt_lev@yahoo.com, andrey.shemakov@gmal.com,

More information

A Probabilistic Theory of Coherence

A Probabilistic Theory of Coherence A Probablstc Theory of Coherence BRANDEN FITELSON. The Coherence Measure C Let E be a set of n propostons E,..., E n. We seek a probablstc measure C(E) of the degree of coherence of E. Intutvely, we want

More information

8 Algorithm for Binary Searching in Trees

8 Algorithm for Binary Searching in Trees 8 Algorthm for Bnary Searchng n Trees In ths secton we present our algorthm for bnary searchng n trees. A crucal observaton employed by the algorthm s that ths problem can be effcently solved when the

More information

DEFINING %COMPLETE IN MICROSOFT PROJECT

DEFINING %COMPLETE IN MICROSOFT PROJECT CelersSystems DEFINING %COMPLETE IN MICROSOFT PROJECT PREPARED BY James E Aksel, PMP, PMI-SP, MVP For Addtonal Informaton about Earned Value Management Systems and reportng, please contact: CelersSystems,

More information

L10: Linear discriminants analysis

L10: Linear discriminants analysis L0: Lnear dscrmnants analyss Lnear dscrmnant analyss, two classes Lnear dscrmnant analyss, C classes LDA vs. PCA Lmtatons of LDA Varants of LDA Other dmensonalty reducton methods CSCE 666 Pattern Analyss

More information

INVESTIGATION OF VEHICULAR USERS FAIRNESS IN CDMA-HDR NETWORKS

INVESTIGATION OF VEHICULAR USERS FAIRNESS IN CDMA-HDR NETWORKS 21 22 September 2007, BULGARIA 119 Proceedngs of the Internatonal Conference on Informaton Technologes (InfoTech-2007) 21 st 22 nd September 2007, Bulgara vol. 2 INVESTIGATION OF VEHICULAR USERS FAIRNESS

More information

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

How To Understand The Results Of The German Meris Cloud And Water Vapour Product Ttel: Project: Doc. No.: MERIS level 3 cloud and water vapour products MAPP MAPP-ATBD-ClWVL3 Issue: 1 Revson: 0 Date: 9.12.1998 Functon Name Organsaton Sgnature Date Author: Bennartz FUB Preusker FUB Schüller

More information

Dscusses and Recommendaton Systems

Dscusses and Recommendaton Systems 10 Content-based Recommendaton Systems Mchael J. Pazzan 1 and Danel Bllsus 2 1 Rutgers Unversty, ASBIII, 3 Rutgers Plaza New Brunswck, NJ 08901 pazzan@rutgers.edu 2 FX Palo Alto Laboratory, Inc., 3400

More information

Cluster Analysis of Data Points using Partitioning and Probabilistic Model-based Algorithms

Cluster Analysis of Data Points using Partitioning and Probabilistic Model-based Algorithms Internatonal Journal of Appled Informaton Systems (IJAIS) ISSN : 2249-0868 Foundaton of Computer Scence FCS, New York, USA Volume 7 No.7, August 2014 www.jas.org Cluster Analyss of Data Ponts usng Parttonng

More information

IMPACT ANALYSIS OF A CELLULAR PHONE

IMPACT ANALYSIS OF A CELLULAR PHONE 4 th ASA & μeta Internatonal Conference IMPACT AALYSIS OF A CELLULAR PHOE We Lu, 2 Hongy L Bejng FEAonlne Engneerng Co.,Ltd. Bejng, Chna ABSTRACT Drop test smulaton plays an mportant role n nvestgatng

More information

Design of Output Codes for Fast Covering Learning using Basic Decomposition Techniques

Design of Output Codes for Fast Covering Learning using Basic Decomposition Techniques Journal of Computer Scence (7): 565-57, 6 ISSN 59-66 6 Scence Publcatons Desgn of Output Codes for Fast Coverng Learnng usng Basc Decomposton Technques Aruna Twar and Narendra S. Chaudhar, Faculty of Computer

More information

A Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression

A Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression Novel Methodology of Workng Captal Management for Large Publc Constructons by Usng Fuzzy S-curve Regresson Cheng-Wu Chen, Morrs H. L. Wang and Tng-Ya Hseh Department of Cvl Engneerng, Natonal Central Unversty,

More information

Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger

Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger Enrchng the Knowledge Sources Used n a Maxmum Entropy Part-of-Speech Tagger Krstna Toutanova Dept of Computer Scence Gates Bldg 4A, 353 Serra Mall Stanford, CA 94305 9040, USA krstna@cs.stanford.edu Chrstopher

More information

An interactive system for structure-based ASCII art creation

An interactive system for structure-based ASCII art creation An nteractve system for structure-based ASCII art creaton Katsunor Myake Henry Johan Tomoyuk Nshta The Unversty of Tokyo Nanyang Technologcal Unversty Abstract Non-Photorealstc Renderng (NPR), whose am

More information

Chapter 6. Classification and Prediction

Chapter 6. Classification and Prediction Chapter 6. Classfcaton and Predcton What s classfcaton? What s Lazy learners (or learnng from predcton? your neghbors) Issues regardng classfcaton and Frequent-pattern-based predcton classfcaton Classfcaton

More information

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network *

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network * JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 24, 819-840 (2008) Data Broadcast on a Mult-System Heterogeneous Overlayed Wreless Network * Department of Computer Scence Natonal Chao Tung Unversty Hsnchu,

More information

Product Quality and Safety Incident Information Tracking Based on Web

Product Quality and Safety Incident Information Tracking Based on Web Product Qualty and Safety Incdent Informaton Trackng Based on Web News 1 Yuexang Yang, 2 Correspondng Author Yyang Wang, 2 Shan Yu, 2 Jng Q, 1 Hual Ca 1 Chna Natonal Insttute of Standardzaton, Beng 100088,

More information

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

On the Optimal Control of a Cascade of Hydro-Electric Power Stations On the Optmal Control of a Cascade of Hydro-Electrc Power Statons M.C.M. Guedes a, A.F. Rbero a, G.V. Smrnov b and S. Vlela c a Department of Mathematcs, School of Scences, Unversty of Porto, Portugal;

More information

A DATA MINING APPLICATION IN A STUDENT DATABASE

A DATA MINING APPLICATION IN A STUDENT DATABASE JOURNAL OF AERONAUTICS AND SPACE TECHNOLOGIES JULY 005 VOLUME NUMBER (53-57) A DATA MINING APPLICATION IN A STUDENT DATABASE Şenol Zafer ERDOĞAN Maltepe Ünversty Faculty of Engneerng Büyükbakkalköy-Istanbul

More information

Number of Levels Cumulative Annual operating Income per year construction costs costs ($) ($) ($) 1 600,000 35,000 100,000 2 2,200,000 60,000 350,000

Number of Levels Cumulative Annual operating Income per year construction costs costs ($) ($) ($) 1 600,000 35,000 100,000 2 2,200,000 60,000 350,000 Problem Set 5 Solutons 1 MIT s consderng buldng a new car park near Kendall Square. o unversty funds are avalable (overhead rates are under pressure and the new faclty would have to pay for tself from

More information

AN E-MAIL FILTERING AGENT BASED ON SUPPORT VECTOR MACHINES

AN E-MAIL FILTERING AGENT BASED ON SUPPORT VECTOR MACHINES BULETINUL INSTITUTULUI POLITEHNIC DIN IAŞI Publcat de Unverstatea Tehncă Gheorghe Asach dn Iaş Tomul LVI (LX), Fasc. 3, 200 SecŃa AUTOMATICĂ ş CALCULATOARE AN E-MAIL FILTERING AGENT BASED ON SUPPORT VECTOR

More information

Multiple-Period Attribution: Residuals and Compounding

Multiple-Period Attribution: Residuals and Compounding Multple-Perod Attrbuton: Resduals and Compoundng Our revewer gave these authors full marks for dealng wth an ssue that performance measurers and vendors often regard as propretary nformaton. In 1994, Dens

More information

An Analysis of Central Processor Scheduling in Multiprogrammed Computer Systems

An Analysis of Central Processor Scheduling in Multiprogrammed Computer Systems STAN-CS-73-355 I SU-SE-73-013 An Analyss of Central Processor Schedulng n Multprogrammed Computer Systems (Dgest Edton) by Thomas G. Prce October 1972 Techncal Report No. 57 Reproducton n whole or n part

More information

14.74 Lecture 5: Health (2)

14.74 Lecture 5: Health (2) 14.74 Lecture 5: Health (2) Esther Duflo February 17, 2004 1 Possble Interventons Last tme we dscussed possble nterventons. Let s take one: provdng ron supplements to people, for example. From the data,

More information

Rank Based Clustering For Document Retrieval From Biomedical Databases

Rank Based Clustering For Document Retrieval From Biomedical Databases Jayanth Mancassamy et al /Internatonal Journal on Computer Scence and Engneerng Vol.1(2), 2009, 111-115 Rank Based Clusterng For Document Retreval From Bomedcal Databases Jayanth Mancassamy Department

More information

Project Networks With Mixed-Time Constraints

Project Networks With Mixed-Time Constraints Project Networs Wth Mxed-Tme Constrants L Caccetta and B Wattananon Western Australan Centre of Excellence n Industral Optmsaton (WACEIO) Curtn Unversty of Technology GPO Box U1987 Perth Western Australa

More information

NEURO-FUZZY INFERENCE SYSTEM FOR E-COMMERCE WEBSITE EVALUATION

NEURO-FUZZY INFERENCE SYSTEM FOR E-COMMERCE WEBSITE EVALUATION NEURO-FUZZY INFERENE SYSTEM FOR E-OMMERE WEBSITE EVALUATION Huan Lu, School of Software, Harbn Unversty of Scence and Technology, Harbn, hna Faculty of Appled Mathematcs and omputer Scence, Belarusan State

More information

Learning from Large Distributed Data: A Scaling Down Sampling Scheme for Efficient Data Processing

Learning from Large Distributed Data: A Scaling Down Sampling Scheme for Efficient Data Processing Internatonal Journal of Machne Learnng and Computng, Vol. 4, No. 3, June 04 Learnng from Large Dstrbuted Data: A Scalng Down Samplng Scheme for Effcent Data Processng Che Ngufor and Janusz Wojtusak part

More information

CHAPTER 14 MORE ABOUT REGRESSION

CHAPTER 14 MORE ABOUT REGRESSION CHAPTER 14 MORE ABOUT REGRESSION We learned n Chapter 5 that often a straght lne descrbes the pattern of a relatonshp between two quanttatve varables. For nstance, n Example 5.1 we explored the relatonshp

More information

Learning from Multiple Outlooks

Learning from Multiple Outlooks Learnng from Multple Outlooks Maayan Harel Department of Electrcal Engneerng, Technon, Hafa, Israel She Mannor Department of Electrcal Engneerng, Technon, Hafa, Israel maayanga@tx.technon.ac.l she@ee.technon.ac.l

More information

Dynamic Resource Allocation for MapReduce with Partitioning Skew

Dynamic Resource Allocation for MapReduce with Partitioning Skew Ths artcle has been accepted for publcaton n a future ssue of ths journal, but has not been fully edted. Content may change pror to fnal publcaton. Ctaton nformaton: DOI 1.119/TC.216.253286, IEEE Transactons

More information