Clustering Text Documents: An Overview

Size: px
Start display at page:

Download "Clustering Text Documents: An Overview"

Transcription

1 Clusterng Text Documents: An Overvew Radu CREȚULESCU*, Danel MORARIU*, Lucan VINȚAN* * "Lucan Blaga" Unversty of Sbu, Romana, Engneerng Faculty, Eml Coranst. no. 4, Sbu, Phone , Fax , e-mal {radu.kretzulescu, danel.moraru, lucan.vntan}@ulbsbu.ro Abstract: Clusterng s an mportant process n text mnng used for gropng documents based on ther contents n order to extract knowledge. In data mnng process, clusterng and cluster analyss can be used as stand-alone applcaton for data dstrbuton analyss, n order to observe certan characterstcs of each cluster (group). In ths paper we wll present some requrements for clusterng algorthms, and then we wll present a possble taxonomy of the commonly used clusterng algorthms. In the last secton we wll dscuss the external and nternal technques for clusterng valdaton. Key words: Text Mnng, Unsupervsed leanng, Clusterng, Cluster Valdaton 1. Introducton In the recent years, sgnfcant ncreases n usng the Web and the mprovements of the qualty and speed of the Internet have transformed our socety nto one that depends strongly on the nformaton. The huge amount of data that s generated by ths process of communcaton represents mportant nformaton that accumulates daly and that s stored n form of text documents, databases etc. The retrevng of ths data s not smple and therefore the data mnng technques were developed for extractng nformaton and knowledge that are represented n patterns or concepts that are sometmes not obvous. As mentoned n [5, 9], machne learnng software provdes the basc technques for data mnng by extractng nformaton from raw data contaned n databases. The process usually goes through the followng steps: transform the data nto a sutable format data cleanng deducton or conclusons on the extracted data. Machne learnng technques are dvded nto two sub domans: supervsed learnng and unsupervsed learnng. Under the category of unsupervsed learnng, one of the man tools s data clusterng. Ths paper attempts to provde taxonomy of the most mportant algorthms used for clusterng. For each algorthm category, we selected the most common verson of ths entre famly. Below we present algorthms used n context of document clusterng.. Unsupervsed versus supervsed learnng In supervsed learnng, the algorthm receves data (the text documents) and the class label for the correspondng classes of the documents (called labeled data). The purpose of supervsed learnng s to learn the concepts that correctly classfy documents for gven classfcaton algorthm. Based on ths learnng the classfer wll be able to predct the correct class for unseen examples. Under ths paradgm, t s also possble the appearance of the over-fttng effects. Ths wll happen when the algorthm memorzes all the labels for each case. The outcomes of supervsed learnng are usually assessed on a dsont test set of examples from the tranng set examples. Classfcaton methods used are vared, rangng from tradtonal statstcal approaches, neural networks to kernel type algorthms [6]. The qualty measure for classfcaton s gven by the accuracy of classfcaton. In unsupervsed learnng the algorthm receves only data wthout the class label (called unlabeled data) and the algorthm task s to fnd an adequate representaton of data dstrbuton. The central pont of ths paper s to present the clusterng as a key aspect n unsupervsed learnng. Some researchers have combned unsupervsed and supervsed learnng that has emerged the concept of sem-supervsed learnng [4]. In ths approach s appled ntally an unknown data set n order to make some assumptons about data dstrbuton and then ths hypothess s confrmed or reected by a supervsed approach.

2 3. Cluster Analyss Cluster analyss s an teratve process of clusterng and cluster valdaton facltated by clusterng algorthms and cluster valdaton methods. Cluster analyss ncludes two maor aspects: clusterng and cluster valdaton. Clusterng refers to groupng obects accordng to certan crtera. To acheve ths goal researchers have developed many algorthms [5, 13, 14, 15]. Snce there are no general algorthms that can be appled to all types of cases t becomes necessary to apply a valdaton mechansm so that the user wll fnd an algorthm sutable for ts partcular case. Cluster analyss s a process of dscovery through exploraton. It can be used to dscover structures wthout the need for nterpretaton [13]. The valdaton of the clusters becomes a cluster qualty evaluaton process. 3.1 Defntons Def. 4.1: Clusterng s the process of groupng physcal or abstract obects nto classes of smlar propertes. [14] A cluster s a collecton of obects that are smlar to each other from the same group and are dssmlar to obects that are form dfferent groups. Def. 4. Conceptual clusterng s a process of groupng where the obects n a group wll form a class only f they can be descrbed by one concept. Ths approach dffers from conventonal clusterng where the dssmlarty measure s based on mathematcal dstances. The basc dea for clusterng, whch s the startng pont for both prevous presented approaches, s to fnd those clusters wth hgh nter-cluster smlarty and very small ntra-cluster smlarty. The man drectons of clusterng research are focused more on dstance-based cluster analyss. Several algorthms have been developed such as parttonal algorthms lke k-means and k-medods, or herarchcal algorthms. Further nvestgatons have demonstrated the usefulness of other approaches such as algorthms based on suffx trees [0, 7], the bo-nspred algorthms, such as those based on the behavor of ants n search of food and creaton of cemeteres [1, 7, 17], partcles swarms (partcle swarm optmzaton) [1] and ontologes [3, 1]. 3.. General requrements for clusterng algorthms Typcal requrements for clusterng algorthms n data mnng and text documents [9] are: a. Scalablty of the algorthms: Many clusterng algorthms work very well wth small data sets. However, large databases contanng mllons of obects and usng a large sample of these sets would lead to nconclusve results. b. Ablty to use dfferent types of attrbutes: Many clusterng algorthms use numercal data as nput. In some cases t s necessary to apply clusterng algorthms on strng data, bnary data, ordnal data or a combnaton thereof. c. Clusters of arbtrary shape recognton: Many algorthms fnd clusters usng geometrc dstances such as Eucldean dstance or Manhattan dstance. These algorthms, however, tend to fnd sphercal clusters wth smlar sze and densty. In realty, clusters can be n any form. d. Settng nput parameters based on mnmal knowledge of the feld: Many clusterng algorthms requre certan parameters such as the number of clusters that need to be determned (e.g. k-means). The result of clusterng can be sgnfcantly dfferent dependng on the nput parameters set. The nput parameters for data sets contanng hgh dmensonal obects are dffcult to be determned. e. Ablty to use data that contan nose: Many real data sets contanng mssng data, unknown or outlner. Some algorthms are senstve to such data and clusterng qualty becomes poor. f. Insenstvty to the order of data processng: Some clusterng algorthms may produce dfferent results for dfferent order of the nputs (e.g., Sngle Pass, BIRCH- Balanced Iteratve Reducng and Clusterng). g. Hgh dmensonalty: Data may contan a lot of dmensons and attrbutes. Many clusterng algorthms fal to produce good results even for data wth small dmensons. The human eye fals to assess the qualty of a cluster by up to three dmensons. h. Interpretablty and usablty: Users want that the clusterng results shall be useful, nterpretable and understandable. 4. Data used n clusterng 4.1 Data representaton A data matrx represents for the clusterng algorthm a matrx wth n obects and each has p attrbutes. The representaton of the data wll be a matrx of n p attrbutes. x11... x1... x1 p x1... x... x p (4.1) xn1... xn... xnp 4. Dssmlarty Matrx Ths matrx wll be a N x N square matrx and contans the dssmlarty measures of all pars of obects for the clusterng. Snce d(, ) = d (, ) and d(, ) = 0 we obtan the followng dssmlarty matrx 0 d(,1) 0 d(3,1) d(3, ) 0 (4.) dn (,1) dnn (, 1) 0 where d(, ) s the measure of dssmlarty between two obects. Obects are clustered accordng to the smlarty between them or by ther dssmlarty.

3 4.3 Dssmlarty versus smlarty The dssmlarty d(, ) s a postve number close to 0 when the documents and are close to each other and ncreases when the dstance between and ncreases. The dssmlarty between two obects can be obtaned by subectve evaluaton made by experts based on drect observaton or can be calculated based on correlaton coeffcents. The smlarty s(, ) s a postve number close to 0 when the two documents are not smlar and ncreases when the documents are more smlar. If the smlarty and the dssmlarty coeffcents are n the range [0, 1] the followng equaton s true: d (, ) 1 s (, ) (4.3) 4.4. Common formula used for dssmlarty To calculate the dssmlarty of obects we calculate the dstance between every two obects. The dstances must satsfy the followng propertes: d (, ) 0 non-negatvty d (,) 0 d (, ) d(,) Symmetry d(, ) d(, h) d( h, ) Trangle rule Commonly used formulas for dstance are: a. Eucldean dstance: p E(, ) ( k k) (4.4) k 1 d x x x x b. Manhattan dstance d ( x, x ) x x (4.5) Ma k k k1 c. Mnkowsk dstance p q M k k k 1 p q d ( x, x ) ( x x ) (4.6) d. Cosne dstance dcos p p xk x k k1 k1 p ( xk xk ) k 1 e. Canberra dstance d CAN n k 1 (4.7) xk xk (, ) (4.8) x x k 5. A possble Taxonomy for clusterng algorthms Arrangng data n dfferent groups can be made usng dfferent strateges. Based on ths, the clusterng strategy can be dvded nto several categores: technques based on data parttonng, clusterng technques based on k herarchcal methods [5], methods based on attrbutes order, methods based on densty, grd-based methods and methods based on models. [9]. 5.1 Parttonng Methods If n obects must be grouped nto k groups, then a parttonng method constructs k parttons of the obects, each partton s represented by a cluster wth k n. The clusters are formed takng nto account the optmzaton of a crteron functon. Ths functon expresses the dssmlarty between the obects, so that the obects that are grouped nto a cluster are smlar and obects from dfferent clusters are dssmlar. For ths type of groupng method the clusters must satsfy two condtons: Each cluster must contan at least one obect; Each obect must be ncluded n a sngle cluster. The basc dea of ths type of methods s that the algorthm ntally starts wth a gven number k groups representng the number of parttons (clusters) and then t apples a parttonng method that recalculates the k clusters. The method also uses an teratve relocaton technque that attempts to mprove the parttonng by movng obects from one group to another. The movng crteron needs to respect the condton that the obects of the same cluster are smlar (close). There are dfferent crtera to evaluate the qualty of clusters. These wll be presented n Secton 6. We can dentfy three dfferent types of parttonng algorthms: K-Means clusterng algorthms; for ths type of algorthm each cluster s represented by the average value of all obects from the same cluster. K-Medods clusterng algorthms: for ths type of algorthm each cluster s represented by the obects whch are closest to the center (medod) of the cluster. Probablstc algorthms for clusterng: a probablstc method assumes that the data come from a mxture of populatons whose dstrbutons and probabltes must be determned. These algorthms can be used n collectons of small to medum data when the resulted clusters can be found n sphercal forms K-Means algorthm K-Means algorthm [10], [11], [18] s the most popular algorthm used n scentfc and ndustral applcatons. The name of the algorthm comes from the representaton of k-clusters C whose weghts (means) c are calculated as the centrod of the ponts whch are grouped n cluster C. The value for the parameter k s set at the start of the algorthm and represent the number of cluster that want to be obtaned. The smlarty between the tems from the same cluster C s calculated n relaton to the centrod. The error crteron used s defned n the formula k E( C) x c (5.1) 1 x C Where x s the gven entry obect and c the centrod of C.

4 The scope s to acheve a greater ntra-cluster smlarty and very small nter-cluster smlarty so that we get k clusters as compact and smultaneously separated as possble. It s obvous that ths type of algorthm works wth numercal varables. The algorthm performs the followng steps: 1. Generate k random centers n n-dmensonal space, whch represents the ntal centrods.. For each obect compute the dstance between t and all the centrods, then the obect s assgned to the closest centrod for computng the dstance there are several formulas presented n secton Once all obects have been properly assgned to the centrods, the postons of the k centrods are recalculated as the center of all samples assgned to each centrod ndvdually. 4. Repeat steps and 3 untl the centrods no longer change ther postons. 5. The obects assgned to the centrods represent the contents of the fnal clusters K-Medods method K-medods algorthm s an adaptaton of k-means algorthm. Instead of computng the centrod of each group, the algorthm chooses a representatve element called medod for each cluster at each teraton. The medod for each group s calculated by fndng an tem n the cluster that mnmzes the sum: C d (, ) (5.) Where C s the cluster that contans the obect and d(,) the dstance between obect and obect. There are two advantages by usng exstng obects as cluster centers. Frst, a medod can descrbe usefully a cluster. Second, t s not necessary to calculate the dstance between the obects for each teraton of the algorthm. The dstances can be computed once and then saved n a so called dstances matrx. The steps of the k-medods algorthm can be summarzed as follows: 1. Choose k obects random to be the orgnal medods of the clusters.. Assgn each obect to the nearest cluster assocated to the medod. 3. Recompute the postons for the k-medods. 4. Repeat steps and 3 untl the medods doesn t change. Kaufman and Rousseeuw presented n [16] the PAM (Partton Around Medods) algorthm whch s an teratvely mplementaton of the k-medods algorthm. The K-Medods algorthm s more robust n terms of nose comparng to the k-means algorthm. Snce k- Medods s workng drectly wth medods t s not so much nfluenced by the outlners lke the centrod calculaton nfluences the k-means algorthm. However n terms of computatonal cost the k-medods algorthm has a hgher cost because for a sngle teraton the computatonal cost s O(k(n-k) ). The K-medods algorthm s effcent on small data sets. On large data sets, due to ts hgh computatonal cost, the algorthm becomes qute slow. For applyng the PAM algorthm to large data sets Rousseew and Kaufman (1990) [16] developed the algorthm CLARA (Clusterng Large Applcatons) who reles on the PAM algorthm whch s appled only to a sample of data extracted from a gven set. CLARA wll dscover n ths reduced set the k- medods. The complexty of each teraton s O(kS +k(kn)) where S s the data set sze chosen (number of samples), k the number of clusters and n the total number of ponts. In [15,, 5] s proposed to choose the value S = 40+k. 5. Herarchcal methods These types of methods provde a herarchcal structure of the set of obects. There are two approaches of herarchcal methods: Agglomeratve -These methods have a "bottom-up" approach. At the begnnng of the algorthm each obect s a cluster, and then n the next steps the clusters are merged together based on smlarty measures creatng a herarchcal structure untl all clusters are oned nto a sngle cluster or untl another stoppng condton (gven number of clusters, tme, etc.) s reached. Dvsve - These methods have a "top-down" approach. Frst all obects are consdered to be contaned n a sngle cluster, then after successve teratons each cluster s dvded nto smaller clusters untl each obect s a cluster or untl a stoppng condton s reached. A maor problem of herarchcal clusterng algorthms s that once the dvson or mergng step has been made t cannot be canceled. Ths ssue also represents a maor advantage of ths type of calculatons due to reducton algorthms avodng calculatng the varous combnatons of possbltes. AGGLOMERATIVE ALGORITHMS Dependng on the method of computng the smlarty between the clusters n the herarchcal agglomeratve approaches there can be dstngushed more models of ths category of clusterng algorthms [19]: Sngle Lnk The smlarty between two clusters s gven by the mnmum dstance between the most smlar two documents contaned n those clusters. sm( A, B) mn sm( a, b) (5.3) aa, bb where A and B are the clusters to be merged and a, b are documents wth aa and bb. Complete lnk The smlarty between two clusters s gven by the smlarty of the most dssmlar documents from the two clusters. Ths s equvalent to choosng pars of clusters for mergng wth the smallest dameter. sm( A, B) max sm( a, b) (5.4) aab, B Average lnk The smlarty between two clusters s calculated as the average sums of dstances between each element from the frst cluster and all elements from the second cluster.

5 1 sm( A, B) sm( a, b) A B aa, bb (5.5) Centrod Lnk The smlarty between two clusters s calculated as the dstance between the cluster centrods. sm( A, B) CA CB (5.6) Ward's method Ward's method [31] attempts to mnmze the Sum of Squares (SS) of any two (hypothetcal) clusters that can be formed at each step: (5.7) ( A, B) x mab x ma x mb AB A B na nb mamb na nb Where m s the centrod of cluster and n s the number of obects n cluster. If the two pars of cluster centers wll have dentcal dstances usng the Ward's method then the smaller clusters wll be merged. DIVISIVE ALGORITHMS The dvsve herarchcal clusterng approach s conceptually more complex than the agglomeratve. If the frst step of an agglomeratve approach can have n( n1) possbltes to merge two obects then n the 1 dvsve approach there are n 1possbltes to dvde data nto two groups. Ths number s consderably hgher than that n case of agglomeratve methods. Kaufmann and Rousseeuw have presented n [15] a dvsve clusterng algorthm DIANA (Dvsve Analyss). Algorthms lke BIRCH (Balanced Iteratve Reducng and Clusterng) proposed n [8] and CURE (Clusterng Usng Representatves) proposed n [8] combnes the parttonng methods wth the herarchcal ones. 5.3 Method based on the order of words Suffx Tree Clusterng (STC) The algorthm presented n [7] does not requre a vector representaton of obects (documents) lke the algorthms presented n the prevous sectons. The STC algorthm uses the Suffx Tree Document Model (STDM) to represent the documents. Ths algorthm wll take nto account the words order, and creates clusters based on words or groups of consecutve words from the documents. Thus can be a maor advantage because t takes account of word order n sentences. A suffx tree of a document d s a compact tree contanng all suffxes s of that document. In our case a suffx s a strng consstng of one or more words. Rules for buldng the suffx tree: 1. Each nternal node other than root must have at least two chldren and each edge leavng a node s labeled wth a nonempty substrng n.. Any two edges that start from the same node cannot start wth the same word. Suffx tree constructon Let d = w 1 w w 3... w m be the representaton of a document as a sequence of words n same order from as n the document wth 1, m and S a set of n documents. The suffx tree for a set S contanng n strngs, each of them havng the length m n s a tree that contans exactly one root node and Σm n leaf nodes. Snce each node contans at least two chldren t wll contan the common part of at least two suffxes. Each leaf of the tree can represent one or more documents and all edges from the root to the leaf nodes represent an entre document or a substrng from a document. Two documents wth many common nodes tend to be smlar. Selectng the base nodes After creatng the suffx tree some nodes wll become clusters. The basc rules for a node to become a cluster are: a. Each node n the suffx tree that has at least two chldren s the consdered a basc node and receves a score S(B) accordng to equaton (5.8) b. After orderng the nodes n descendng order based on the obtaned score the algorthm retans the frst k nodes (a gven threshold). These basc nodes wll contnue to be used n the next step. c. Nodes that receve a score lower than predetermned threshold are elmnated. Formulas used: Let B be the number of documents contaned n cluster B and P the number of words n the sentence P. A cluster score s calculated as SB ( ) B f( P) (5.8) where the weghtng functon for the length of a sentence s 0.5, P 1 f( P) P, P 6 6, P 6 ( 5.9) Ths functon penalzes the sentences whch contan a sngle word, s lnearly ncreasng for sentences contanng two to sx words and become constant for sentences longer than sx words. The smlarty between two clusters s calculated: B B B B 1, SIM ( B, B ) B B 0, otherwse (5.10) where α s a chosen threshold. In [7] threshold for α s 0.5 and n [6] t s 0.8. Regardng the formula for calculatng the smlarty we can choose to use the Jaccard's coeffcent, whch gves us the degree of smlarty between two nodes. The hgher the value the greater s the smlarty between nodes (the value s n the nterval [0, 1] - wth 1 for dentcal nodes):

6 J B, B B B B B (5.11) SOM neural networks. It may also be ncluded n ths category, algorthms that rely on ants behavor and partcle swarms. Combnng the base nodes a. The smlarty between two base nodes s calculated accordng to formula (5.10). b. If the smlarty between two nodes exceeds a certan threshold then they wll merge and after mergng the resulted node wll become a cluster. 5.4 Densty-based methods A topologcal space can be decomposed nto ts connected components. Startng from ths pont of vew we can defne a cluster as a related component that s constructed n the drecton where the densty s hgher. Ths s the reason why the densty based algorthms are able to dscover clusters of arbtrary shapes. Also constructng towards hgher densty the algorthm protects tself aganst nose (outlner). However, these methods have the dsadvantage that f a cluster conssts of two adacent areas of dfferent densty but larger than a gven threshold the result s not very nformatve [9]. Important algorthms of ths method nclude: DBSCAN (Densty-Based Spatal Clusterng wth Nose Algorthm) [], whch produces clusters wth a threshold densty. OPTICS (Orderng Ponts to Identfy the Clusterng Structure) [] extends the DBSCAN algorthm and computes the dstances ε less gven threshold ε wth 0 ε ε. The only dfference from the OPTICS algorthm DBSCAN s that the obects are not assgned to a cluster but the algorthm saves the order n whch obects were processed. 5.5 Grd- based methods Grd-based methods are methods that quantfy the representaton of the obect space nto a fnte number of cells that form a grd structure. Once ths representaton s made a clusterng algorthm s appled. The great advantage s that the processng tme depends only on the number of cells n each dmenson of the transformed space and not by the number of obects contaned n the cells. Typcal algorthms for ths type of method are STING (Statstcal Informaton Grd) [3], DENCLUE (DENsty-based CLUstErng), Clque, MAFIA (MAxmal Frequent Itemset Algorthm) 5.6 Model - based methods Methods based on models are assumng that there s a model for each cluster and they try to fnd data that ft best wth the model. A model-based algorthm can dscover clusters by constructng a densty functon that reflects the spatal dstrbuton of the data. However these algorthms can lead to an automatc dscovery of clusters, when the numbers of clusters are based on standard statstcal methods takng nto account the nose. Algorthms that fall nto ths category are: AutoClass (Automatc Classfcaton), COWEB, Class. Also n ths category are used as a model and algorthms such as 6. Cluster valdaton Each clusterng algorthm appled on the same set of data wll group the data n dfferent ways dependng on the smlarty metrc used. Ths makes analyss for the effcency of clusterng algorthms very dffcult. To evaluate the performance or qualty of clusterng algorthms, obectve measures must be establshed. There are three types of qualty measures: a. external, when there s an a pror knowledge about the clusters (we have pre-labeled data); b. nternal, whch have no nformaton about clusters; c. relatve assessng group dfferences between dfferent solutons. External measures are appled to both classfcaton and clusterng algorthms, whle nternal and relatve measures are appled only to the clusterng algorthms. EXTERNAL VALIDATION MEASURES External valdaton measures requre pre-labeled data sets for clusterng analyss. Because clusterng data can be done from many ponts of vew and the labels may vary from these ponts of vew, the comparson s made to the group contanng the most data n a specfc category. Some of the external evaluaton measures: Precson s the percentage of retreved documents whch are really relevant to that category. The value for the precson s n the nterval [0,1], wth 1 as best. For a cluster C and a known class S we compute the precson: C S precson( C, S ) C (6.1) Recall s the percentage of documents that are relevant to that category and are ndeed grouped nto that category. The value for the recall s n the nterval [0,1], wth 1 as best. C S recall( C, S ) S (6.) Accuracy s the percentage of documents that are correctly grouped nto categores dependng on the labels of documents (needs labeled documents). Fmeasure s a measure that combnes precson and recall ste and s calculated accordng to formula (6.3) precza( C, S ) recall( C, S ) Fmeasure( C, S ) precza( C, S ) recall( C, S ) (6.3) For each class t wll select only the cluster wth the hghest Fmeasure. In the fnal Fmeasure for overall measure of a clusterng soluton s weghted by the sze of each cluster. INTERNAL VALIDATION MEASURES In [31] are presented four such metrcs: Compactness Ths measure expresses how smlar

7 are the data from a gven cluster n C compactness( C) ( c x ) 1 (6.4) Where C s the current cluster, n c the number of elements n the cluster, c the cluster centrod and x s an element of the cluster. In other words, these metrc measures how "close" the documents wthn a cluster are. The value s n the nterval [0, ) and as the lower the better s the measure Separabllty Ths measure between the clusters expresses how dssmlar the clusters are. separablty mn( dst( c, c )) (6.5) Thus for each cluster, the nearest cluster s determned. Based on ths metrc we seek clusters that maxmze ths value, so that the clusters are very dssmlar. Balance - The balance s computed as the clusters are formed and expresses how well-balanced the formed clusters are. (6.6) Where n s the total number of documents, n the number of documents n the cluster I (the formula takes nto consderaton the bggest formed cluster), k s the number of formed clusters. Ths measure can take values n the nterval [0,1], the maxmum value 1 s acheved when all clusters have the same number of documents. A value close to 0 s obtaned when the number of documents contaned n clusters vares greatly. 7. Conclusons C, 1 n/ k balance max ( n ) 1, k Clusterng n text documents s an unsupervsed learnng method of classfyng documents wth a certan degree of smlarty usng dfferent metrcs based on dstance. Bascally, the clusterng algorthm must dentfy (must fnd) the clusters n whch documents are grouped and/or some patterns (rules) that separate one group from another group. There s no predefned taxonomy. It s establshed when the clusterng algorthm run to the set of documents. Cluster evaluaton s an next mportant task. The formulas presented for nternal and external evaluaton of clusters can gve a measure for evaluate the qualty of clusterng process. Stll t s very hard to compare dfferent clusterng results. 8. Acknowledgment Ths work was partally supported by CNCSIS- UEFISCSU, proect number PN II-RU code PD_670/ References [1] Abraham, A., Ramos, V. - "Web Usage Mnng Usng Artfcal Ant Colony Clusterng and Genetc Programmng" Proc. of the Congress on Evolutonary Computaton (CEC 003), Canberra, pp , IEEE Press. 003 [] Ankerst, M., Breunng, M., Kregel, H.-P., Sander, J.,- Oprcs: "Orderng ponts to dentfy the clusterng structure" In. Proc ACM-SIGMOD, Int. Conf. Management of Data, Phladelpha, 1999 [3] Batet, M., Valls, A., Gbert, K. - "Improvng classcal clusterng wth ontologes", In Proc. IASC 008, Yokohama, 008 [4] Bennett, K. P., Demrz A., - "Sem-supervsed support vector machnes." In M. S. Kearns, S. A. Solla, and D. A. Cohn, edtors, Advances n Neural Informaton Processng Systems, pages , Cambrdge, MA, MIT Press. [5] Berkhn, P., - "A Survey of Clusterng Data Mnng Technques", Kogan, Jacob;Ncholas, Charles; Teboulle, Marc (Eds.) Groupng Multdmensonal Data, Sprnger Press, pp. 5-7 (006) [6] Burges, C. J. C., - "A tutoral on support vector machnes for pattern recognton." Data Mnng and Knowledge Dscovery, ():11-167, [7] Dorgo, M., Bonabeau, E., Theraulaz, G., - "Ant Algorthms and stgmergy" Future Generaton Computer Systems Vol.16, 000. [8] Guha, S. Rastog, R., Shm, K., - "CURE: an effcent clusterng algorthm for large databases." n Proceedngs of the 1998 ACM SIGMOD nternatonal conference on Management of data (SIGMOD '98), AshutoshTwary and Mchael Frankln (Eds.). ACM, New York, NY, USA1998 [9] Han, J., Kamber, M., - "Data Mnng: Concepts and Technques", Morgan Kaufmann Publshers, 001 [10] Hartgan, J. A., Wong, M., Algorthm as 136: A k- means clusterng algorthm, Journal of the Royal Statstcal Socety. Seres C (Appled Statstcs), London, 1979 [11] Hartgan, J. A., - "Clusterng Algorthms", New York: John Wley & Sons, Inc, 1975 [1] Hotho, A., Staab, S. Stumme,G., - "Ontologes Improve Text Document Clusterng", IEEE Internatonal Conference on, p. 541, Thrd IEEE Internatonal Conference on Data Mnng (ICDM'03), 003 Implementaton, Morgan Kaufmann Press, 000 [13] Jan, A., K., Dubes, R.,C. - "Algorthms for Clusterng Data", Prentce Hall, Englewood Clffs, NJ [14] Jan, A. Murty, M. N., Flynn, P. J., - "Data Clusterng: A Revew", ACM Computng Surveys, Vol. 31(3), pp (1999) [15] Kaufman, L. and Rousseeuw, P.J. - "Fndng Groups n Data: An Introducton to Cluster Analyss", Wley-Interscence, New York (Seres n Appled Probablty and Statstcs), 1990 [16] Kaufman, L., Rousseeuw, P. J., - "Clusterng by means of medods, n Statstcal Data Analyss based on the L, Norm", edted by Y. Dodge, Elsever/North- Holland, Amsterdam, 1987

8 [17] Labroche, N., Monmarché, N., Venturn G., - "AntClust: Ant Clusterng and Web Usage Mnng, In Genetc and Evolutonary Computaton", GECCO 003 Lecture Notes n Computer Scence, Volume 73/003, 01, 003. [18] MacQueen, J. B., - "Some Methods for classfcaton and Analyss of Multvarate Observatons", Proceedngs of 5-th Berkeley Symposum on Mathematcal Statstcs and Probablty, Berkeley, Unversty of Calforna Press, 1:81-97, 1967 [19] Mannng, C., - "An Introducton to Informaton Retreval", Cambrdge Unversty Press, 009 [0] Meyer,S., Sten, B., Potthast, M., - "The Suffx Tree Document Model Revsted", Proceedngs of the I- KNOW 05, 5th Internatonal Conference on Knowlegdge Management, Journal of Unversal Computer Scence, pp , Graz, 005 [1] van der Merwe, D.W., Engelbrecht, A.P., - "Data clusterng usng partcle swarm optmzaton, n: The 003 Congress on Evolutonary Computaton",. CEC '03, Canberra, 003 [] Ng, R., Han, J. - "Effcent and Effectve Clusterng Methods for Spatal Data Mnng", Proceedngs of Internatonal Conference on Very Large Data Bases,Santago, Chle, Sept [3] Wang, W., Yang, J., Muntz, R., - "STING: A statstcal nformaton grd approach to spatal data mnng", In Proc Int. Conf. Very Large Data Bases, Athens, 1997 [4] Ward, J. H., - "Herachcal groupng to optmze an obectve functon", J. Am. Statst. Assoc. 58, 36-44, 1963 [5] We, C. P., Lee, Y. H., Hsu, C. M., - "Emprcal Comparson of Fast Clusterng In Algorthms for Large Data Sets", In Experts Systems wth Applcatons vol. 4, pp [6] Wen, H., - "Web SnppetsClusterng Based on an Improved Suffx Tree Algorthm", Proceedngs of FSKD 009, Sxth Internatonal Conference on Fuzzy Systems and Knowledge Dscovery, Tann, Chna, August 009 [7] Zamr, O, Etzon, O., - "Web Document Clusterng: A Feasblty Demonstraton", Proceedngs of the 1st Internatonal ACM SIGIR Conference on Research and Development n Informaton Retreval, Melbourne, Australa, 1998 [8] Zhang, T. Ramakrshnan, R. Lvny, M., - "BRICH: an effcent data clusterng method for very large databases", n Proceedngs 1996 ACM-SIGMOD Int. Conf. Management of Data, Montreal, 1996 [9] Zhang, B., - "Generalzed k-harmonc Means Dynamc Weghtng of Data n Unsupervsed Learnng", In Proceedngs of the 1st SIAM ICDM, Chcago, 001 [30] Zhao, Y. Karyps, G., - "Emprcal and Theoretcal Comparsons of Selected Crteron Functons for Document Clusterng", Machne Learnng, Vol. 55. No. 3, 004 [31] Metrcs for evaluatng clusterng algorthmshttp:// accesat

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis The Development of Web Log Mnng Based on Improve-K-Means Clusterng Analyss TngZhong Wang * College of Informaton Technology, Luoyang Normal Unversty, Luoyang, 471022, Chna wangtngzhong2@sna.cn Abstract.

More information

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications CMSC828G Prncples of Data Mnng Lecture #9 Today s Readng: HMS, chapter 9 Today s Lecture: Descrptve Modelng Clusterng Algorthms Descrptve Models model presents the man features of the data, a global summary

More information

Cluster Analysis. Cluster Analysis

Cluster Analysis. Cluster Analysis Cluster Analyss Cluster Analyss What s Cluster Analyss? Types of Data n Cluster Analyss A Categorzaton of Maor Clusterng Methos Parttonng Methos Herarchcal Methos Densty-Base Methos Gr-Base Methos Moel-Base

More information

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College Feature selecton for ntruson detecton Slobodan Petrovć NISlab, Gjøvk Unversty College Contents The feature selecton problem Intruson detecton Traffc features relevant for IDS The CFS measure The mrmr measure

More information

Document Clustering Analysis Based on Hybrid PSO+K-means Algorithm

Document Clustering Analysis Based on Hybrid PSO+K-means Algorithm Document Clusterng Analyss Based on Hybrd PSO+K-means Algorthm Xaohu Cu, Thomas E. Potok Appled Software Engneerng Research Group, Computatonal Scences and Engneerng Dvson, Oak Rdge Natonal Laboratory,

More information

A DATA MINING APPLICATION IN A STUDENT DATABASE

A DATA MINING APPLICATION IN A STUDENT DATABASE JOURNAL OF AERONAUTICS AND SPACE TECHNOLOGIES JULY 005 VOLUME NUMBER (53-57) A DATA MINING APPLICATION IN A STUDENT DATABASE Şenol Zafer ERDOĞAN Maltepe Ünversty Faculty of Engneerng Büyükbakkalköy-Istanbul

More information

320 The Internatonal Arab Journal of Informaton Technology, Vol. 5, No. 3, July 2008 Comparsons Between Data Clusterng Algorthms Osama Abu Abbas Computer Scence Department, Yarmouk Unversty, Jordan Abstract:

More information

An Interest-Oriented Network Evolution Mechanism for Online Communities

An Interest-Oriented Network Evolution Mechanism for Online Communities An Interest-Orented Network Evoluton Mechansm for Onlne Communtes Cahong Sun and Xaopng Yang School of Informaton, Renmn Unversty of Chna, Bejng 100872, P.R. Chna {chsun,yang}@ruc.edu.cn Abstract. Onlne

More information

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur Module LOSSLESS IMAGE COMPRESSION SYSTEMS Lesson 3 Lossless Compresson: Huffman Codng Instructonal Objectves At the end of ths lesson, the students should be able to:. Defne and measure source entropy..

More information

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm Avalable onlne www.ocpr.com Journal of Chemcal and Pharmaceutcal Research, 2014, 6(7):1884-1889 Research Artcle ISSN : 0975-7384 CODEN(USA) : JCPRC5 A hybrd global optmzaton algorthm based on parallel

More information

The Greedy Method. Introduction. 0/1 Knapsack Problem

The Greedy Method. Introduction. 0/1 Knapsack Problem The Greedy Method Introducton We have completed data structures. We now are gong to look at algorthm desgn methods. Often we are lookng at optmzaton problems whose performance s exponental. For an optmzaton

More information

What is Candidate Sampling

What is Candidate Sampling What s Canddate Samplng Say we have a multclass or mult label problem where each tranng example ( x, T ) conssts of a context x a small (mult)set of target classes T out of a large unverse L of possble

More information

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S S C H E D A E I N F O R M A T I C A E VOLUME 0 0 On Mean Squared Error of Herarchcal Estmator Stans law Brodowsk Faculty of Physcs, Astronomy, and Appled Computer Scence, Jagellonan Unversty, Reymonta

More information

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by 6 CHAPTER 8 COMPLEX VECTOR SPACES 5. Fnd the kernel of the lnear transformaton gven n Exercse 5. In Exercses 55 and 56, fnd the mage of v, for the ndcated composton, where and are gven by the followng

More information

Recurrence. 1 Definitions and main statements

Recurrence. 1 Definitions and main statements Recurrence 1 Defntons and man statements Let X n, n = 0, 1, 2,... be a MC wth the state space S = (1, 2,...), transton probabltes p j = P {X n+1 = j X n = }, and the transton matrx P = (p j ),j S def.

More information

Conversion between the vector and raster data structures using Fuzzy Geographical Entities

Conversion between the vector and raster data structures using Fuzzy Geographical Entities Converson between the vector and raster data structures usng Fuzzy Geographcal Enttes Cdála Fonte Department of Mathematcs Faculty of Scences and Technology Unversty of Combra, Apartado 38, 3 454 Combra,

More information

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network 700 Proceedngs of the 8th Internatonal Conference on Innovaton & Management Forecastng the Demand of Emergency Supples: Based on the CBR Theory and BP Neural Network Fu Deqang, Lu Yun, L Changbng School

More information

Ants Can Schedule Software Projects

Ants Can Schedule Software Projects Ants Can Schedule Software Proects Broderck Crawford 1,2, Rcardo Soto 1,3, Frankln Johnson 4, and Erc Monfroy 5 1 Pontfca Unversdad Católca de Valparaíso, Chle FrstName.Name@ucv.cl 2 Unversdad Fns Terrae,

More information

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP)

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP) 6.3 / -- Communcaton Networks II (Görg) SS20 -- www.comnets.un-bremen.de Communcaton Networks II Contents. Fundamentals of probablty theory 2. Emergence of communcaton traffc 3. Stochastc & Markovan Processes

More information

Forecasting the Direction and Strength of Stock Market Movement

Forecasting the Direction and Strength of Stock Market Movement Forecastng the Drecton and Strength of Stock Market Movement Jngwe Chen Mng Chen Nan Ye cjngwe@stanford.edu mchen5@stanford.edu nanye@stanford.edu Abstract - Stock market s one of the most complcated systems

More information

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements Lecture 3 Densty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 5329 Sennott Square Next lecture: Matlab tutoral Announcements Rules for attendng the class: Regstered for credt Regstered for audt (only f there

More information

An Alternative Way to Measure Private Equity Performance

An Alternative Way to Measure Private Equity Performance An Alternatve Way to Measure Prvate Equty Performance Peter Todd Parlux Investment Technology LLC Summary Internal Rate of Return (IRR) s probably the most common way to measure the performance of prvate

More information

Implementation of Deutsch's Algorithm Using Mathcad

Implementation of Deutsch's Algorithm Using Mathcad Implementaton of Deutsch's Algorthm Usng Mathcad Frank Roux The followng s a Mathcad mplementaton of Davd Deutsch's quantum computer prototype as presented on pages - n "Machnes, Logc and Quantum Physcs"

More information

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching) Face Recognton Problem Face Verfcaton Problem Face Verfcaton (1:1 matchng) Querymage face query Face Recognton (1:N matchng) database Applcaton: Access Control www.vsage.com www.vsoncs.com Bometrc Authentcaton

More information

Cluster Analysis of Data Points using Partitioning and Probabilistic Model-based Algorithms

Cluster Analysis of Data Points using Partitioning and Probabilistic Model-based Algorithms Internatonal Journal of Appled Informaton Systems (IJAIS) ISSN : 2249-0868 Foundaton of Computer Scence FCS, New York, USA Volume 7 No.7, August 2014 www.jas.org Cluster Analyss of Data Ponts usng Parttonng

More information

NEURO-FUZZY INFERENCE SYSTEM FOR E-COMMERCE WEBSITE EVALUATION

NEURO-FUZZY INFERENCE SYSTEM FOR E-COMMERCE WEBSITE EVALUATION NEURO-FUZZY INFERENE SYSTEM FOR E-OMMERE WEBSITE EVALUATION Huan Lu, School of Software, Harbn Unversty of Scence and Technology, Harbn, hna Faculty of Appled Mathematcs and omputer Scence, Belarusan State

More information

SCHEDULING OF CONSTRUCTION PROJECTS BY MEANS OF EVOLUTIONARY ALGORITHMS

SCHEDULING OF CONSTRUCTION PROJECTS BY MEANS OF EVOLUTIONARY ALGORITHMS SCHEDULING OF CONSTRUCTION PROJECTS BY MEANS OF EVOLUTIONARY ALGORITHMS Magdalena Rogalska 1, Wocech Bożeko 2,Zdzsław Heduck 3, 1 Lubln Unversty of Technology, 2- Lubln, Nadbystrzycka 4., Poland. E-mal:rogalska@akropols.pol.lubln.pl

More information

Single and multiple stage classifiers implementing logistic discrimination

Single and multiple stage classifiers implementing logistic discrimination Sngle and multple stage classfers mplementng logstc dscrmnaton Hélo Radke Bttencourt 1 Dens Alter de Olvera Moraes 2 Vctor Haertel 2 1 Pontfíca Unversdade Católca do Ro Grande do Sul - PUCRS Av. Ipranga,

More information

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ). REVIEW OF RISK MANAGEMENT CONCEPTS LOSS DISTRIBUTIONS AND INSURANCE Loss and nsurance: When someone s subject to the rsk of ncurrng a fnancal loss, the loss s generally modeled usng a random varable or

More information

Performance Analysis and Coding Strategy of ECOC SVMs

Performance Analysis and Coding Strategy of ECOC SVMs Internatonal Journal of Grd and Dstrbuted Computng Vol.7, No. (04), pp.67-76 http://dx.do.org/0.457/jgdc.04.7..07 Performance Analyss and Codng Strategy of ECOC SVMs Zhgang Yan, and Yuanxuan Yang, School

More information

Master s Thesis. Configuring robust virtual wireless sensor networks for Internet of Things inspired by brain functional networks

Master s Thesis. Configuring robust virtual wireless sensor networks for Internet of Things inspired by brain functional networks Master s Thess Ttle Confgurng robust vrtual wreless sensor networks for Internet of Thngs nspred by bran functonal networks Supervsor Professor Masayuk Murata Author Shnya Toyonaga February 10th, 2014

More information

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification Lecture 4: More classfers and classes C4B Machne Learnng Hlary 20 A. Zsserman Logstc regresson Loss functons revsted Adaboost Loss functons revsted Optmzaton Multple class classfcaton Logstc Regresson

More information

Can Auto Liability Insurance Purchases Signal Risk Attitude?

Can Auto Liability Insurance Purchases Signal Risk Attitude? Internatonal Journal of Busness and Economcs, 2011, Vol. 10, No. 2, 159-164 Can Auto Lablty Insurance Purchases Sgnal Rsk Atttude? Chu-Shu L Department of Internatonal Busness, Asa Unversty, Tawan Sheng-Chang

More information

Implementations of Web-based Recommender Systems Using Hybrid Methods

Implementations of Web-based Recommender Systems Using Hybrid Methods Internatonal Journal of Computer Scence & Applcatons Vol. 3 Issue 3, pp 52-64 2006 Technomathematcs Research Foundaton Implementatons of Web-based Recommender Systems Usng Hybrd Methods Janusz Sobeck Insttute

More information

Estimating the Number of Clusters in Genetics of Acute Lymphoblastic Leukemia Data

Estimating the Number of Clusters in Genetics of Acute Lymphoblastic Leukemia Data Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13 : 109-118 Estmatng the Number of Clusters n Genetcs of Acute Lymphoblastc Leukema Data Mahmoud K. Okasha, Khaled I.A. Almghar Department of

More information

L10: Linear discriminants analysis

L10: Linear discriminants analysis L0: Lnear dscrmnants analyss Lnear dscrmnant analyss, two classes Lnear dscrmnant analyss, C classes LDA vs. PCA Lmtatons of LDA Varants of LDA Other dmensonalty reducton methods CSCE 666 Pattern Analyss

More information

An interactive system for structure-based ASCII art creation

An interactive system for structure-based ASCII art creation An nteractve system for structure-based ASCII art creaton Katsunor Myake Henry Johan Tomoyuk Nshta The Unversty of Tokyo Nanyang Technologcal Unversty Abstract Non-Photorealstc Renderng (NPR), whose am

More information

Improved SVM in Cloud Computing Information Mining

Improved SVM in Cloud Computing Information Mining Internatonal Journal of Grd Dstrbuton Computng Vol.8, No.1 (015), pp.33-40 http://dx.do.org/10.1457/jgdc.015.8.1.04 Improved n Cloud Computng Informaton Mnng Lvshuhong (ZhengDe polytechnc college JangSu

More information

Project Networks With Mixed-Time Constraints

Project Networks With Mixed-Time Constraints Project Networs Wth Mxed-Tme Constrants L Caccetta and B Wattananon Western Australan Centre of Excellence n Industral Optmsaton (WACEIO) Curtn Unversty of Technology GPO Box U1987 Perth Western Australa

More information

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network *

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network * JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 24, 819-840 (2008) Data Broadcast on a Mult-System Heterogeneous Overlayed Wreless Network * Department of Computer Scence Natonal Chao Tung Unversty Hsnchu,

More information

1 Example 1: Axis-aligned rectangles

1 Example 1: Axis-aligned rectangles COS 511: Theoretcal Machne Learnng Lecturer: Rob Schapre Lecture # 6 Scrbe: Aaron Schld February 21, 2013 Last class, we dscussed an analogue for Occam s Razor for nfnte hypothess spaces that, n conjuncton

More information

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

On the Optimal Control of a Cascade of Hydro-Electric Power Stations On the Optmal Control of a Cascade of Hydro-Electrc Power Statons M.C.M. Guedes a, A.F. Rbero a, G.V. Smrnov b and S. Vlela c a Department of Mathematcs, School of Scences, Unversty of Porto, Portugal;

More information

Logical Development Of Vogel s Approximation Method (LD-VAM): An Approach To Find Basic Feasible Solution Of Transportation Problem

Logical Development Of Vogel s Approximation Method (LD-VAM): An Approach To Find Basic Feasible Solution Of Transportation Problem INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME, ISSUE, FEBRUARY ISSN 77-866 Logcal Development Of Vogel s Approxmaton Method (LD- An Approach To Fnd Basc Feasble Soluton Of Transportaton

More information

"Research Note" APPLICATION OF CHARGE SIMULATION METHOD TO ELECTRIC FIELD CALCULATION IN THE POWER CABLES *

Research Note APPLICATION OF CHARGE SIMULATION METHOD TO ELECTRIC FIELD CALCULATION IN THE POWER CABLES * Iranan Journal of Scence & Technology, Transacton B, Engneerng, ol. 30, No. B6, 789-794 rnted n The Islamc Republc of Iran, 006 Shraz Unversty "Research Note" ALICATION OF CHARGE SIMULATION METHOD TO ELECTRIC

More information

Study on Model of Risks Assessment of Standard Operation in Rural Power Network

Study on Model of Risks Assessment of Standard Operation in Rural Power Network Study on Model of Rsks Assessment of Standard Operaton n Rural Power Network Qngj L 1, Tao Yang 2 1 Qngj L, College of Informaton and Electrcal Engneerng, Shenyang Agrculture Unversty, Shenyang 110866,

More information

Mining Multiple Large Data Sources

Mining Multiple Large Data Sources The Internatonal Arab Journal of Informaton Technology, Vol. 7, No. 3, July 2 24 Mnng Multple Large Data Sources Anmesh Adhkar, Pralhad Ramachandrarao 2, Bhanu Prasad 3, and Jhml Adhkar 4 Department of

More information

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12 14 The Ch-squared dstrbuton PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 1 If a normal varable X, havng mean µ and varance σ, s standardsed, the new varable Z has a mean 0 and varance 1. When ths standardsed

More information

Support Vector Machines

Support Vector Machines Support Vector Machnes Max Wellng Department of Computer Scence Unversty of Toronto 10 Kng s College Road Toronto, M5S 3G5 Canada wellng@cs.toronto.edu Abstract Ths s a note to explan support vector machnes.

More information

The OC Curve of Attribute Acceptance Plans

The OC Curve of Attribute Acceptance Plans The OC Curve of Attrbute Acceptance Plans The Operatng Characterstc (OC) curve descrbes the probablty of acceptng a lot as a functon of the lot s qualty. Fgure 1 shows a typcal OC Curve. 10 8 6 4 1 3 4

More information

Gender Classification for Real-Time Audience Analysis System

Gender Classification for Real-Time Audience Analysis System Gender Classfcaton for Real-Tme Audence Analyss System Vladmr Khryashchev, Lev Shmaglt, Andrey Shemyakov, Anton Lebedev Yaroslavl State Unversty Yaroslavl, Russa vhr@yandex.ru, shmaglt_lev@yahoo.com, andrey.shemakov@gmal.com,

More information

Data Visualization by Pairwise Distortion Minimization

Data Visualization by Pairwise Distortion Minimization Communcatons n Statstcs, Theory and Methods 34 (6), 005 Data Vsualzaton by Parwse Dstorton Mnmzaton By Marc Sobel, and Longn Jan Lateck* Department of Statstcs and Department of Computer and Informaton

More information

Calculation of Sampling Weights

Calculation of Sampling Weights Perre Foy Statstcs Canada 4 Calculaton of Samplng Weghts 4.1 OVERVIEW The basc sample desgn used n TIMSS Populatons 1 and 2 was a two-stage stratfed cluster desgn. 1 The frst stage conssted of a sample

More information

A Simple Approach to Clustering in Excel

A Simple Approach to Clustering in Excel A Smple Approach to Clusterng n Excel Aravnd H Center for Computatonal Engneerng and Networng Amrta Vshwa Vdyapeetham, Combatore, Inda C Rajgopal Center for Computatonal Engneerng and Networng Amrta Vshwa

More information

An Analysis of Central Processor Scheduling in Multiprogrammed Computer Systems

An Analysis of Central Processor Scheduling in Multiprogrammed Computer Systems STAN-CS-73-355 I SU-SE-73-013 An Analyss of Central Processor Schedulng n Multprogrammed Computer Systems (Dgest Edton) by Thomas G. Prce October 1972 Techncal Report No. 57 Reproducton n whole or n part

More information

Abstract. Clustering ensembles have emerged as a powerful method for improving both the

Abstract. Clustering ensembles have emerged as a powerful method for improving both the Clusterng Ensembles: {topchyal, Models jan, of punch}@cse.msu.edu Consensus and Weak Parttons * Alexander Topchy, Anl K. Jan, and Wllam Punch Department of Computer Scence and Engneerng, Mchgan State Unversty

More information

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

How To Understand The Results Of The German Meris Cloud And Water Vapour Product Ttel: Project: Doc. No.: MERIS level 3 cloud and water vapour products MAPP MAPP-ATBD-ClWVL3 Issue: 1 Revson: 0 Date: 9.12.1998 Functon Name Organsaton Sgnature Date Author: Bennartz FUB Preusker FUB Schüller

More information

Mining Feature Importance: Applying Evolutionary Algorithms within a Web-based Educational System

Mining Feature Importance: Applying Evolutionary Algorithms within a Web-based Educational System Mnng Feature Importance: Applyng Evolutonary Algorthms wthn a Web-based Educatonal System Behrouz MINAEI-BIDGOLI 1, and Gerd KORTEMEYER 2, and Wllam F. PUNCH 1 1 Genetc Algorthms Research and Applcatons

More information

1. Measuring association using correlation and regression

1. Measuring association using correlation and regression How to measure assocaton I: Correlaton. 1. Measurng assocaton usng correlaton and regresson We often would lke to know how one varable, such as a mother's weght, s related to another varable, such as a

More information

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence 1 st Internatonal Symposum on Imprecse Probabltes and Ther Applcatons, Ghent, Belgum, 29 June 2 July 1999 How Sets of Coherent Probabltes May Serve as Models for Degrees of Incoherence Mar J. Schervsh

More information

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic Lagrange Multplers as Quanttatve Indcators n Economcs Ivan Mezník Insttute of Informatcs, Faculty of Busness and Management, Brno Unversty of TechnologCzech Republc Abstract The quanttatve role of Lagrange

More information

Risk-based Fatigue Estimate of Deep Water Risers -- Course Project for EM388F: Fracture Mechanics, Spring 2008

Risk-based Fatigue Estimate of Deep Water Risers -- Course Project for EM388F: Fracture Mechanics, Spring 2008 Rsk-based Fatgue Estmate of Deep Water Rsers -- Course Project for EM388F: Fracture Mechancs, Sprng 2008 Chen Sh Department of Cvl, Archtectural, and Envronmental Engneerng The Unversty of Texas at Austn

More information

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION Vson Mouse Saurabh Sarkar a* a Unversty of Cncnnat, Cncnnat, USA ABSTRACT The report dscusses a vson based approach towards trackng of eyes and fngers. The report descrbes the process of locatng the possble

More information

A Replication-Based and Fault Tolerant Allocation Algorithm for Cloud Computing

A Replication-Based and Fault Tolerant Allocation Algorithm for Cloud Computing A Replcaton-Based and Fault Tolerant Allocaton Algorthm for Cloud Computng Tork Altameem Dept of Computer Scence, RCC, Kng Saud Unversty, PO Box: 28095 11437 Ryadh-Saud Araba Abstract The very large nfrastructure

More information

BERNSTEIN POLYNOMIALS

BERNSTEIN POLYNOMIALS On-Lne Geometrc Modelng Notes BERNSTEIN POLYNOMIALS Kenneth I. Joy Vsualzaton and Graphcs Research Group Department of Computer Scence Unversty of Calforna, Davs Overvew Polynomals are ncredbly useful

More information

FREQUENCY OF OCCURRENCE OF CERTAIN CHEMICAL CLASSES OF GSR FROM VARIOUS AMMUNITION TYPES

FREQUENCY OF OCCURRENCE OF CERTAIN CHEMICAL CLASSES OF GSR FROM VARIOUS AMMUNITION TYPES FREQUENCY OF OCCURRENCE OF CERTAIN CHEMICAL CLASSES OF GSR FROM VARIOUS AMMUNITION TYPES Zuzanna BRO EK-MUCHA, Grzegorz ZADORA, 2 Insttute of Forensc Research, Cracow, Poland 2 Faculty of Chemstry, Jagellonan

More information

Bayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending

Bayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending Proceedngs of 2012 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 25 (2012) (2012) IACSIT Press, Sngapore Bayesan Network Based Causal Relatonshp Identfcaton and Fundng Success

More information

Efficient Project Portfolio as a tool for Enterprise Risk Management

Efficient Project Portfolio as a tool for Enterprise Risk Management Effcent Proect Portfolo as a tool for Enterprse Rsk Management Valentn O. Nkonov Ural State Techncal Unversty Growth Traectory Consultng Company January 5, 27 Effcent Proect Portfolo as a tool for Enterprse

More information

ERP Software Selection Using The Rough Set And TPOSIS Methods

ERP Software Selection Using The Rough Set And TPOSIS Methods ERP Software Selecton Usng The Rough Set And TPOSIS Methods Under Fuzzy Envronment Informaton Management Department, Hunan Unversty of Fnance and Economcs, No. 139, Fengln 2nd Road, Changsha, 410205, Chna

More information

How To Calculate The Accountng Perod Of Nequalty

How To Calculate The Accountng Perod Of Nequalty Inequalty and The Accountng Perod Quentn Wodon and Shlomo Ytzha World Ban and Hebrew Unversty September Abstract Income nequalty typcally declnes wth the length of tme taen nto account for measurement.

More information

Lecture 2: Single Layer Perceptrons Kevin Swingler

Lecture 2: Single Layer Perceptrons Kevin Swingler Lecture 2: Sngle Layer Perceptrons Kevn Sngler kms@cs.str.ac.uk Recap: McCulloch-Ptts Neuron Ths vastly smplfed model of real neurons s also knon as a Threshold Logc Unt: W 2 A Y 3 n W n. A set of synapses

More information

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

Luby s Alg. for Maximal Independent Sets using Pairwise Independence Lecture Notes for Randomzed Algorthms Luby s Alg. for Maxmal Independent Sets usng Parwse Independence Last Updated by Erc Vgoda on February, 006 8. Maxmal Independent Sets For a graph G = (V, E), an ndependent

More information

A Comparative Study of Data Clustering Techniques

A Comparative Study of Data Clustering Techniques A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES A Comparatve Study of Data Clusterng Technques Khaled Hammouda Prof. Fakhreddne Karray Unversty of Waterloo, Ontaro, Canada Abstract Data clusterng s a

More information

THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION

THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION Internatonal Journal of Electronc Busness Management, Vol. 3, No. 4, pp. 30-30 (2005) 30 THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION Yu-Mn Chang *, Yu-Cheh

More information

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK Sample Stablty Protocol Background The Cholesterol Reference Method Laboratory Network (CRMLN) developed certfcaton protocols for total cholesterol, HDL

More information

Design and Development of a Security Evaluation Platform Based on International Standards

Design and Development of a Security Evaluation Platform Based on International Standards Internatonal Journal of Informatcs Socety, VOL.5, NO.2 (203) 7-80 7 Desgn and Development of a Securty Evaluaton Platform Based on Internatonal Standards Yuj Takahash and Yoshm Teshgawara Graduate School

More information

A Secure Password-Authenticated Key Agreement Using Smart Cards

A Secure Password-Authenticated Key Agreement Using Smart Cards A Secure Password-Authentcated Key Agreement Usng Smart Cards Ka Chan 1, Wen-Chung Kuo 2 and Jn-Chou Cheng 3 1 Department of Computer and Informaton Scence, R.O.C. Mltary Academy, Kaohsung 83059, Tawan,

More information

Tools for Privacy Preserving Distributed Data Mining

Tools for Privacy Preserving Distributed Data Mining Tools for Prvacy Preservng Dstrbuted Data Mnng hrs lfton, Murat Kantarcoglu, Jadeep Vadya Purdue Unversty Department of omputer Scences 250 N Unversty St West Lafayette, IN 47907-2066 USA (clfton, kanmurat,

More information

Product Quality and Safety Incident Information Tracking Based on Web

Product Quality and Safety Incident Information Tracking Based on Web Product Qualty and Safety Incdent Informaton Trackng Based on Web News 1 Yuexang Yang, 2 Correspondng Author Yyang Wang, 2 Shan Yu, 2 Jng Q, 1 Hual Ca 1 Chna Natonal Insttute of Standardzaton, Beng 100088,

More information

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting Causal, Explanatory Forecastng Assumes cause-and-effect relatonshp between system nputs and ts output Forecastng wth Regresson Analyss Rchard S. Barr Inputs System Cause + Effect Relatonshp The job of

More information

Planning for Marketing Campaigns

Planning for Marketing Campaigns Plannng for Marketng Campagns Qang Yang and Hong Cheng Department of Computer Scence Hong Kong Unversty of Scence and Technology Clearwater Bay, Kowloon, Hong Kong, Chna (qyang, csch)@cs.ust.hk Abstract

More information

A novel Method for Data Mining and Classification based on

A novel Method for Data Mining and Classification based on A novel Method for Data Mnng and Classfcaton based on Ensemble Learnng 1 1, Frst Author Nejang Normal Unversty;Schuan Nejang 641112,Chna, E-mal: lhan-gege@126.com Abstract Data mnng has been attached great

More information

Rank Based Clustering For Document Retrieval From Biomedical Databases

Rank Based Clustering For Document Retrieval From Biomedical Databases Jayanth Mancassamy et al /Internatonal Journal on Computer Scence and Engneerng Vol.1(2), 2009, 111-115 Rank Based Clusterng For Document Retreval From Bomedcal Databases Jayanth Mancassamy Department

More information

J. Parallel Distrib. Comput.

J. Parallel Distrib. Comput. J. Parallel Dstrb. Comput. 71 (2011) 62 76 Contents lsts avalable at ScenceDrect J. Parallel Dstrb. Comput. journal homepage: www.elsever.com/locate/jpdc Optmzng server placement n dstrbuted systems n

More information

Using Supervised Clustering Technique to Classify Received Messages in 137 Call Center of Tehran City Council

Using Supervised Clustering Technique to Classify Received Messages in 137 Call Center of Tehran City Council Usng Supervsed Clusterng Technque to Classfy Receved Messages n 137 Call Center of Tehran Cty Councl Mahdyeh Haghr 1*, Hamd Hassanpour 2 (1) Informaton Technology engneerng/e-commerce, Shraz Unversty (2)

More information

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

v a 1 b 1 i, a 2 b 2 i,..., a n b n i. SECTION 8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS 455 8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS All the vector spaces we have studed thus far n the text are real vector spaces snce the scalars are

More information

Searching for Interacting Features for Spam Filtering

Searching for Interacting Features for Spam Filtering Searchng for Interactng Features for Spam Flterng Chuanlang Chen 1, Yun-Chao Gong 2, Rongfang Be 1,, and X. Z. Gao 3 1 Department of Computer Scence, Bejng Normal Unversty, Bejng 100875, Chna 2 Software

More information

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation Exhaustve Regresson An Exploraton of Regresson-Based Data Mnng Technques Usng Super Computaton Antony Daves, Ph.D. Assocate Professor of Economcs Duquesne Unversty Pttsburgh, PA 58 Research Fellow The

More information

Machine Learning and Software Quality Prediction: As an Expert System

Machine Learning and Software Quality Prediction: As an Expert System I.J. Informaton Engneerng and Electronc Busness, 2014, 2, 9-27 Publshed Onlne Aprl 2014 n MECS (http://www.mecs-press.org/) DOI: 10.5815/jeeb.2014.02.02 Machne Learnng and Software Qualty Predcton: As

More information

A Programming Model for the Cloud Platform

A Programming Model for the Cloud Platform Internatonal Journal of Advanced Scence and Technology A Programmng Model for the Cloud Platform Xaodong Lu School of Computer Engneerng and Scence Shangha Unversty, Shangha 200072, Chna luxaodongxht@qq.com

More information

The Journal of Systems and Software

The Journal of Systems and Software The Journal of Systems and Software 82 (2009) 241 252 Contents lsts avalable at ScenceDrect The Journal of Systems and Software journal homepage: www. elsever. com/ locate/ jss A study of project selecton

More information

IMPACT ANALYSIS OF A CELLULAR PHONE

IMPACT ANALYSIS OF A CELLULAR PHONE 4 th ASA & μeta Internatonal Conference IMPACT AALYSIS OF A CELLULAR PHOE We Lu, 2 Hongy L Bejng FEAonlne Engneerng Co.,Ltd. Bejng, Chna ABSTRACT Drop test smulaton plays an mportant role n nvestgatng

More information

Adaptive Fractal Image Coding in the Frequency Domain

Adaptive Fractal Image Coding in the Frequency Domain PROCEEDINGS OF INTERNATIONAL WORKSHOP ON IMAGE PROCESSING: THEORY, METHODOLOGY, SYSTEMS AND APPLICATIONS 2-22 JUNE,1994 BUDAPEST,HUNGARY Adaptve Fractal Image Codng n the Frequency Doman K AI UWE BARTHEL

More information

Risk Model of Long-Term Production Scheduling in Open Pit Gold Mining

Risk Model of Long-Term Production Scheduling in Open Pit Gold Mining Rsk Model of Long-Term Producton Schedulng n Open Pt Gold Mnng R Halatchev 1 and P Lever 2 ABSTRACT Open pt gold mnng s an mportant sector of the Australan mnng ndustry. It uses large amounts of nvestments,

More information

INVESTIGATION OF VEHICULAR USERS FAIRNESS IN CDMA-HDR NETWORKS

INVESTIGATION OF VEHICULAR USERS FAIRNESS IN CDMA-HDR NETWORKS 21 22 September 2007, BULGARIA 119 Proceedngs of the Internatonal Conference on Informaton Technologes (InfoTech-2007) 21 st 22 nd September 2007, Bulgara vol. 2 INVESTIGATION OF VEHICULAR USERS FAIRNESS

More information

Calculating the high frequency transmission line parameters of power cables

Calculating the high frequency transmission line parameters of power cables < ' Calculatng the hgh frequency transmsson lne parameters of power cables Authors: Dr. John Dcknson, Laboratory Servces Manager, N 0 RW E B Communcatons Mr. Peter J. Ncholson, Project Assgnment Manager,

More information

Latent Class Regression. Statistics for Psychosocial Research II: Structural Models December 4 and 6, 2006

Latent Class Regression. Statistics for Psychosocial Research II: Structural Models December 4 and 6, 2006 Latent Class Regresson Statstcs for Psychosocal Research II: Structural Models December 4 and 6, 2006 Latent Class Regresson (LCR) What s t and when do we use t? Recall the standard latent class model

More information

An Enhanced Super-Resolution System with Improved Image Registration, Automatic Image Selection, and Image Enhancement

An Enhanced Super-Resolution System with Improved Image Registration, Automatic Image Selection, and Image Enhancement An Enhanced Super-Resoluton System wth Improved Image Regstraton, Automatc Image Selecton, and Image Enhancement Yu-Chuan Kuo ( ), Chen-Yu Chen ( ), and Chou-Shann Fuh ( ) Department of Computer Scence

More information

Sensor placement for leak detection and location in water distribution networks

Sensor placement for leak detection and location in water distribution networks Sensor placement for leak detecton and locaton n water dstrbuton networks ABSTRACT R. Sarrate*, J. Blesa, F. Near, J. Quevedo Automatc Control Department, Unverstat Poltècnca de Catalunya, Rambla de Sant

More information

Set. algorithms based. 1. Introduction. System Diagram. based. Exploration. 2. Index

Set. algorithms based. 1. Introduction. System Diagram. based. Exploration. 2. Index ISSN (Prnt): 1694-0784 ISSN (Onlne): 1694-0814 www.ijcsi.org 236 IT outsourcng servce provder dynamc evaluaton model and algorthms based on Rough Set L Sh Sh 1,2 1 Internatonal School of Software, Wuhan

More information