TAXONOMIC EVIDENCE APPLYING ALGORITHMS OF INTELLIGENT DATA MINING. ASTEROIDS FAMILIES


 Cameron Perkins
 1 years ago
 Views:
Transcription
1 TAXONOMIC EVIDENCE APPLYING ALGORITHMS OF INTELLIGENT DATA MINING. ASTEROIDS FAMILIES Gregoro Perchnsky(1) Magdalena Servente(2) Arturo Carlos Servetto(1) Ramón García Martínez(3,2) Rosa Beatrz Orellana(4) Angel Lus Plastno (5) Databases and Operatng System Laboratory Computer Scence Department School of Engneerng Unversty of Buenos Ares Paseo Colón Nº th Floor South Wng (1063) Buenos Ares Argentna Phone: (54 11) (nt. 140/145) FAX: (54 1) (2) Intellgent System Laboratory Computer Scence Computer Scence Department School of Engneerng Unversty of Buenos Ares Paseo Colón Nº th Floor South Wng (1063) Buenos Ares Argentna Phone: (54 11) (nt. 140/145) FAX: (54 1) (3) Software Egneerng & Knowledge Engneerng Center (CAPIS). Graduate School Buenos Ares Insttute of Technology Madero 399. (1106) Buenos Ares  Argentna Phone: (5411) FAX: (54 1) ext 277 (4) Mechancs Laboratory Celestal Mechancs Department School of Astronomcal and Geophyscal Scences Unversty of La Plata Paseo del Bosque (1900) La Plata  Buenos Ares  Argentna Phone: (54 221) (5) PROTEM Laboratory Department of Physcal Scences School of Scences  Unversty of La Plata C.C. 727 or (115 # 48/49) (1900) La Plata Buenos Ares  Argentna Phone: (54 221) (54 221) (ext. 247) KEYWORDS: classfcaton, cluster (famly), spectrum, nducton, dvde and rule, entropy. ABSTRACT Numercal Taxonomy ams to group n clusters, usng socalled structure analyss of operatonal taxonomc unts (OTUs or taxons or taxa) through numercal methods. Clusters that constute famles was the purpose of ths seres of last projects. Structural analyss, based on ther phenotypc characterstcs, exhbts the relatonshps, n terms of degrees of smlarty, between two or more OTUs. Enttes formed by dynamc domans of attrbutes, change accordng to taxonomcal requrements: Classfcaton of objects to form famles. Taxonomc objects are represented by semantcs applcaton of Dynamc Relatonal Database Model. Famles of OTUs are obtaned employng as tools ) the Eucldean dstance and ) nearest neghbor technques. Thus taxonomc evdence s gathered so as to quantfy the smlarty for each par of OTUs (pargroup method) obtaned from the basc data matrx. The man contrbuton up untl now s to ntroduce the concept of spectrum of the OTUs, based n the states of ther characters. The concept of famles spectra emerges, f the superposton prncple s appled to the spectra of the OTUs, and the groups are delmted through the maxmum of the BenayméTchebycheff relaton, that determnes Invarants (centrod, varance and radus). A new taxonomc crteron s thereby formulated. An astronomc applcaton s worked out. The result s a new crteron for the classfcaton of asterods n the hyperspace of orbtal proper elements. Thus, a new approach to Computatonal Taxonomy s presented, that has been already employed wth reference to Data Mnng. Ths paper analyses the applcaton of Machne Learnng technques to Data Mnng. We focused our nterest on the TDIDT (Top Down Inducton Trees) nducton famly from preclassfed data, and n partcular to the ID3 and the C4.5 algorthms, created by Qunlan. We tred to determne the degree of effcency acheved by the TDIDT famly s algorthms when appled n data mnng to generate vald models of the data n classfcaton problems wth the Gan of Entropy. The Informatcs (Data Mnng and Computatonal Taxonomy), s always the orgnal objectve of our researches. 1. Introducton Taxonomc objects are here represented by the applcaton of the semantcs of the Dynamc Relatonal Database Model: Classfcaton of objects to form famles or clusters[1]. Famles of OTUs are obtaned employng as tools ) the Eucldean dstance and ) nearest neghbor technques. Thus taxonomc evdence s gathered so as to quantfy the smlarty for each par of OTUs (pargroup method) obtaned from the basc data matrx[2][3][4].the man contrbuton of the seres
2 of papers presented untl now was to ntroduce the concept of spectrum of the OTUs, based n the states of ther characters. The concept of famles spectra emerges, f the superposton prncple s appled to the spectra of the OTUs, and the groups are delmted through the maxmum of the BenayméTchebycheff relaton, that determnes Invarants (centrod, varance and radus) [1]. Applyng the ntegrated, ndependent doman technque dynamcally to compute the Matrx of Smlarty, and, by recourse to an teratve algorthm, famles or clusters are obtaned. A new taxonomc crteron was thereby formulated. The consderable dscrepances among the ncongrutes and exstng classfcatons of astrophyscal study results have motvated an nterdscplnary program of research that notces a clusterng of asterods n stablzed famles [5]. In our case, s worked n an nterdscplnary way n Celestal Mechancs[5], Theory of the Informaton[6][7], Neural Networks[8] and Dynamc Databases [1] and the Algorthmc of the Numercal Taxonomy [2] [4], to acheve the dscovery of the depths of the structure formaton of the Solar An astronomc applcaton s worked out. The result s a new crteron for the classfcaton of asterods n the hyperspace of orbtal proper elements. Thus, a new approach to Computatonal Taxonomy s presented, that has been already employed wth reference to Data Mnng. On the other hand: () the work of [1] has clarfed subtle ponts concernng the dynamc evoluton n the longterm of the asterods orbts, whose modelng s an essental prerequste for the proper elements dervng (for the classfcaton n famles); and () the avalablty of physcal data on szes, shapes, numercal taxonomy and rotaton velocty to many hundred asterods has provoked new famles analyses [1]. Whle the most populous famles appear n both crtera n qute homogeneous form, the crteron of the composton and physcal precedents and cosmochemcal, s a crteron wth more or less dffculty and the crteron whch wth less dffculty has dentfed famles s that one whch uses data from celestal mechancs. We do not consder n the transformaton of sotropc and homogeneous sets, changng the values of the eccentrcty and the semaxs to recompute the values of the zones of ntergap of the asterods belt nto the veloctes n average, or elmnatng groups from 5 or fewer objects, all of whch we consder are outsde a Computatonal crteron. 1.1 Intellgent Data Mnng Introducton Machne Learnng s the feld dedcated to the development of computatonal methods underlyng learnng processes and to applyng computerbased learnng systems to practcal problems. Data Mnng tres to solve those problems related to the search of nterestng patterns and mportant regulartes n large databases [9] [[10]..[15]]. Data Mnng uses methods and strateges from other areas, ncludng Machne Learnng. When we apply Machne Learnng technques to solve a Data Mnng problem, we refer to t as an Intellgent Data Mnng. Ths paper analyses the TDIDT (Top Down Inducton Trees) nducton famly, and n partcular to the C4.5 algorthm[13b][14]. We tred to determne the degree of effcency acheved by the C4.5 algorthm when appled n data mnng to generate vald models of the data n classfcaton problems wth the Gan of Entropy. The C4.5 algorthm generate decson trees and decson rules from preclassfed data. The dvde and rule method s used to buld the decson trees. Ths method dvdes the nput data n subsets accordng to some preestablshed crtera. Then t works on each of these subsets dvdng them agan, untl all the cases present n one subset belong to the same class. 2. Constructng the decson trees 2.1. ID3 The Inducton Decson Trees algorthm was developed as a supervsed learnng method, for buld decson trees from a set of examples. The examples must have a group of attrbutes and a class. The attrbutes and classes must be dscrete, and the classes must be dsjont. The frst versons of ths algorthms allowed just two classes: postve and negatve. Ths restrcton was elmnated n later releases, but the dsjont classes restrcton was preserved. The descrptons generated by ID3 cover each one of the examples n the tranng set C4.5 The C4.5 algorthm s a descendant of the ID3 algorthm, and solves many of ts predecessor s lmtatons. For example, the C4.5 works wth contnuous attrbutes, by dvdng the possble results n two branches: one for those values A <=N and another one for A >N. Moreover, the trees are less bushy because each leaf covers a dstrbuton of classes and not one class n partcular as the ID3 trees, ths makes trees less profound and more understandable[13b][14]. C4.5 generates a decson tree parttonng the data recursvely, accordng to the depthfrst strategy. Before makng each partton, the system analyses all the possble tests that can dvde the data set and selects the test wth the hgher nformaton gan or the hgher gan rato. For dscrete attrbutes, t consders a test wth n possble outcomes, n beng the amount of possble values that the attrbute can take. For contnuous attrbute, a bnary test s performed on each of the values that the attrbute can take.
3 2.3. Decson trees The trees TDIDT, to those whch belong generated them by the ID3 and post C4.5, are bult from method of Hunt.The ID3 and C4.5 algorthms use the dvde and rule strategy to buld the ntal decson tree from the tranng data [16]. The form of ths method to buld a decson tree as of a set T of tranng data, dvdes the data n each step accordng to the values of the best attrbute. Any test that dvdes T n a non trval manner, as long as two dfferent {T } are not empty, s very smple. They wll be the classes {C 1, C 2,..., C k }. T contans cases belongng to several classes, n ths case, the dea s to refne T n subsets of cases that tend, or seem to tend toward a collecton of cases belongng to an only class. It s chosen a test based on an only attrbute, that has one or more resulted, mutually excludng {O 1, O 2,..., O n }. T s partton of the subsets T 1, T 2,..., T n where T contans all the cases of T that have the result O for the elected test. The decson tree for T conssts n a node of decson dentfyng the test, wth a branch for each possble result. The constructon mechansm of the tree s appled recursvely to each subset of tranng data, so that the th branch carry to the decson tree bult by the subset T of tranng data. Stll, the ultmate objectve behnd the process of constructng the decson tree sn t just to fnd any decson tree, but to fnd a decson tree that reveals a certan structure of the doman, that s to say, a tree wth predctve power. That s the reason why each leave must cover a large number of cases, and why each partton must have the smallest possble number of classes. In an deal case, we would lke to choose n each step the test that generates the smallest decson tree. Bascally, what we are lookng for s a small decson tree consstent wth the tranng data. We could explore and analyze all the possble decson trees and choose the smplest one. However, the searchng and hypothess space has an exponental number of trees that would have to be explored. The problem of fndng the smallest decson tree consstent wth the tranng data has NPcomplexty. To calculate whch s the best attrbute to dvde the data n each step, both the nformaton gan and the gan rato were used. Moreover, the trees generated wth the C4.5 algorthm were pruned accordng to the method, ths postprunng was made n order to avod the overfttng of the data Transformng decson trees to decson rules Decson trees that are too bg or too bushy are somewhat dffcult to read and understand because each node must be nterpreted n the context defned by the prevous branches. In any decson tree, the condtons that must be satsfed when classfyng a case can be found followng a tral from the root to the leave to whch that case belongs. If that tral was transformed drectly nto a producton rule, the antecedent of the rule would be the conjuncton of all the tests n the nodes that must be traversed to reach the leaf. All the antecedents of the rules bult ths way are mutually exclusve and exhaustve. To transform a tree to decson rules, the C4.5 algorthm traverses the decson tree n preorder (from the root to the leaves, from left to rght) and constructs a rule for each path from the root to the leaves. The rule s antecedent s the conjuncton of the value tests belongng to each of the vsted nodes, and the class s the one correspondng to the leaf reached Evaluaton of the TDIDT famly We used a crossedvaldaton approach to evaluate the decson trees and the producton rules obtaned. Each dataset was dvded nto two sets wth proportons 2:3 and 1:3. We used two thrds of the orgnal data as a tranng set and one thrd to evaluate the results. We expressed the results of these tests n a confuson matrx, where each class had two values assocated to t: the number of examples classfed correctly and the number of examples classfed as belongng to another class. 3. Requrements engneerng Hrayama Examnng the dstrbuton of the asterods wth respect to ther orbtal elements, n partcular ther prncpal movement, the nclnaton and the eccentrcty, are observed condensatons n dfferent places that seem at random, but there are some cases n whch takng nto account only the quanttes of the probablty s not so evdent [1]. The asterods are also grouped by havng nearby nclnatons or the plans of the orbtal have practcally the same pole (that of the orbt of Jupter), other groupngs do not have the same center but the drawng of the graph takng the eccentrcty and the length of the perhelon nstead of the nclnaton and the length of the node dstrbuton has the shape of a crcumference. Contnung the development of the mentoned theory do not exst doubts of the fact that there are physcal relatonshps that connect the asterods. Because of ths t s that we can venture that there exst assocated asterod famles. The theory remans verfed and thus the famles tranng such as KORONIS (fhn158), EOS (fhn221), THEMIS (fhn 24), FLORA (fhn244), MARIA (fhn170) and PHOCAEA (fhn25) (where fhn s famly head number). The orbtal elements dstrbuton n asterod belts s not at random showng the famles exstence, such that the groups of asterods whose semmajoraxs, ther eccentrcty and ther nclnaton (or the sne of the same) are approxmated to a cluster for certan specal values followng to Arnold (about 1969 there was less than 1735
4 objects) [1]. It has been verfed the agglomeraton n famles (clusterng) correctng the perturbaton perodc produced by secular varatons caused by the major planets, lke Jupter, takng the proper elements. Other groupngs have been dentfed by proper resonance characterstcs or current of mpelled asterods (JET STREAMS) through the FLORA famly and objects that cross MARS n orbts of superor order eccentrcty. Takng nto account that Celestal Bodes are based on physcal attrbutes, on phenotypc characterstc of characters or attrbutes of the asterods and fnally on ther genotypc or common orgn. Nearby vcnty condton should be taken account and the hgh densty famles are the most stable and less random. Famles of Hrayama are confrmed and the small famles are of low densty and the probablty to belong to the famles s hgh and therefore ther couplng by the pargroup method s possble. About 1982, Carus and Valsech there s a record of 2125 smaller planets, asterod type, groupng whch produce dscrepances n the results of the classfcaton computatonal methods based on physcal and dynamcal parameters [1]. Ths dscrepancy among the statstc methods s dsconcertng snce the relatonshp among the members of a famly wth respect to the dynamcal parameters and any physcal study that s accomplshed on the same should be concurrent. It can be observed that the growth n observatons does not solve the dscrepances. Of the methods of famles dentfcaton the dscrepances emerge by ther probablst crtera and the future new asterods dscovery seem that exsts a contradcton between them, but n spte of all ths, f there s congruty, the suspected famles appear n the realty (scentfc method of contrast) but f the methods are arbtrary they are always debatable n addton to the methodologcal doubt [the authors]. For Wllams the problem of Arnold was already dscussed n functon of ther crteron of dstrbuton densty unform Possonan and the proper elements. In the 1980s the analyss technques by smlarty and a generalzed dstance but wth the use of personal judgements or manual managng s what s usual and not an automatc classfcaton. Because of ths appears the consderaton of the varance (σ j ) of the domans and famles for the process of elements dentfcaton wthn the famly or the subsequent. The accepted classes have been splt nto two types: 1), f the class has been dentfed n two ntervals, wthout notceable dfferences and 2), f the class was found mxed couplng wth other less mportant classes n overlap ntervals, beng able to exst masked famles or less relable contours, these aspects should emerge of the proper statstc method. These projects of the Jet Propulson Laboratory, Calforna Insttute of Technology, gave as a result crossng orbts of major planets and that are splt nto famles, by the characterstc of the method. A characterstc s that the strong resonance does not appear n asterod and the weak one s taken as nose. The dstances are taken from a rght lne SUNPLANET (Mars MXR, Jupter JXR, Saturn SXR, etc.) and the proper values are more exact wthn belt than outsde t (somethng whch endorses the theory of the authors). For Knezevc and Mlan the proper asterod elements of an analytcal theory of second order, of asterods dentfed n the prncpal belt (manbelt), are much more exact than those of eccentrcty and small nclnaton n the regon of the famly Thems. Ths s because the short perodcal perturbatons are elmnated and are taken nto account the prncpal second dependent order effects, accordng to the results of the consstent algorthm wth the modern dynamc theores of KolmogorovArnoldMoser, they are about 3495 asterods of the edton of the Lenngrad Ephemerdes of the Mnor Planets. Hldas, Troyanas and the nearby to the Earth (q < 1.1 u.a.) were dscarded. All ths development appears less clear and arbtrary, there s not a formal bass n the relatonshp convergence quantty of teratons (code of qualty QC) and the number of asterods. The crteron of Zappala,Cellno,Farnella and Knezevc (1992 and subsequent) s mportant snce an mproved asterods classfcaton was noted n dynamc famles, analyzng a numbered asterods database, whose proper elements have been computed n a new secondorder, fourthdegree secular perturbaton theory by, and verfed ther stablty n the long term. The multvarate crteron uses the technque of herarchc clusterng data analyss. It was appled to buld for each zone of the asterods belt a "dendrogram, graph, n the proper elements space, wth a dstance n functon related to the necessary ncremental velocty of the orbtal change after the ejecton from the fractonal parent body. The parameters of mportance assocated wth each famly, measured as random concentratons results, (as to transform the zones ansotropy and nhomogeneous nto homogeneous zones and sotropy of the ntergaps zones n the asterods belt modfyng mechancal attrbutes as the semmajoraxs and the nclnaton) and the hardness parameters (stablty), were obtaned repeatng the classfcaton procedure after varyng the velocty elements n small quanttes to recompute the real zones from the calculatons wth the artfcal changng of the coeffcents of the dstance functon. The most mportant and healthy famles are as usual Thems, Eos, and Korons, that jontly nclude 14% of the known prncpal belt of the populaton; but 12 more relable and healthy famles that were found throughout the belt, the majorty departed partally of prevous classfcatons. It s the case of FLORA n the regon of the nteror belt, gvng rse for a very dffcult relable famles dentfcaton, manly when have a hgh densty and the
5 accuracy of the nclnatons and proper eccentrctes s poor manly on account of the proxmty of a strong secular resonance. It s arrved thus to consttute 21 famles wth an actually mportant method and totally automated methods Spectral analyss classfcaton crteron We have decded to accomplsh wth our spectral analyss crteron, the classfcatons extended to the proper elements database of asterods n famles[1]. We recognze that the works of Zappala are very mportant (automatc classfcaton and herarchc method), and a pont of nflecton n the early 90 s but s dfferent the approach because we work n computatonal taxonomy, n a taxonomc hyperspace, and not n a crteron of the composton and physcal precedents and cosmochemcal. Zappala use a confusng methodology, wth only one varable of velocty, and that transforms a homogeneous space nto nhomogeneous one and conversely not clearly unvocal. Incorporatng thus an updated and larger set of osculatng elements that were derved from the secular perturbaton theory, whose accuracy (specfcally, the stablty n the tme) has been extensvely verfed by numercal ntegraton n the longterm; n automatc form, and to prejudce the technque of data analyss n notrandom groups s not used n the proper elements space as n the crteron of Zappala and quanttatvely the statstcal mportance of these groups; wth robustness of the statstcs for the mportant famles wth respect to the small random varatons of proper elements, all based on an analyss on Computatonal Taxonomy. We do not consder n the transformaton of sotropc and homogeneous sets, changng the values of the eccentrcty and the semaxs to recompute the values of the zones of ntergap of the asterods belt nto the veloctes n average, or elmnatng groups from 5 or fewer objects, all of whch we consder are outsde a Computatonal crteron. Thus, a new approach to Computatonal Taxonomy s presented, that has been already employed wth reference to Data Mnng Numercal Taxonomy. We nfer an analogy of the taxonomc representaton [1] n dynamc relatonal database. We explan the theoretcal development of a doman s structured Database and how they can be represented n a Dynamc Database. Immedately we apply our model to the structural aspects of the taxonomy, applyng Scalng Methods for domans[2] [4]. We defne numercal methods used for establshng and defnng clusters by ther taxonomc dstances. We shall let C jk stand for a general dssmlarty coeffcent of whch taxonomc dstance, d jk, s a specal example. Eucldean dstances wll be used n the explanaton of clusterng technques. In dscussng clusterng procedures we make a useful dstncton between three types of measure. We use clusterng strategy of spaceconservng or the spacedstortng strateges that appears as though the space n the mmedate vcnty of a cluster has been contracted or dlated and f we return to the crteron of admsson for a canddate jonng an extant cluster, ths s constant n all pargroup method. Thus we can represent the data matrx and to compute the resemblance of normalzed domans. The steps of clusterng are the recomputaton of the coeffcent of smlarty for future admsson followed by the admsson crteron for new members to an establshed cluster. The strateges of both spaceconservng and spacedstortng that appear n the mmedate vcnty of a cluster ether contract or dlate the space, and ths s constant n all pargroup methods [1] Dsperson Once a typcal value t s known of the varable of the states of the characters, t s necessary to have a parameter that gve an dea of how scattered, or concentrated, are ther values respect to the mean value[19]. It s consdered to the varance as a moment of second order and represents the moment of nerta of the dstrbuton of objects ( mass ) wth respect to ther gravty center: centrod. When X j = ( Xj  Xj ) / σj [2] s a normalzed varable the one whch represents the devaton of Xj wth respect to ther mean n unts of σj. The normalzaton of the states of the character causes that the average of all character wll be of value zero and varance of untary value. If we take as value of the dsperson to the varance σ 2 d, we express the prncple of mnmal square. It wll be g ( Xj ) a not negatve functon of the varable Xj, for all k > 0 wll have to be the probablty functon: If g ( Xj ) = ( Xj  Xj ) 2, K = k 2 σj 2, obtanng for all k > 0 the nequalty from BenayméTchevcheff: P ( Xj  Xj k. σj ) 1 / k 2 Ths nequalty shows that the quantty of ( OTUs ) mass of the located dstrbuton would be of the nterval Xj  k. σj < Xj < Xj + k. σj t s to what s maxmal value equal to 1 / k 2, gvng a utlzaton dea of σj as measure of the dsperson or concentraton Clusters and Spectra. In dscussng Sequental, Agglomeratve, Herarchc and Nonoverlappng (SAHN) [4] clusterng procedures we
6 make a useful dstncton between the three types of measure. We shall be concerned wth clusters J,K and L contanng tj, tk and tl OTUs, respectvely, where tj, tk and tl all 1. OTUs j and k are contaned n clusters J and K, and l L, respectvely. Gven two clusters J and K that are to be joned, the problem s to evaluate the dssmlarty between the resultng jont cluster and addtonal canddates L for further fuson. The fused cluster s denoted (J,K), wth t j,k = t j + t k OTUs. The cluster center or centrod represents an average object, whch s smply a mathematcal construct that permts the characterzaton of the Densty, the Varance, the taxon radus and the range as INVARIANT quanttes. The states of the taxonomc characters n a class, defned ordnarly wth reference to the set of ther propertes, allow one to calculate the dstances between the members of the class. The dstances can be establshed by the smlarty relatonshp among ndvduals (obtanng a matrx of smlarty that has been computed). Consderng characterstc spectra [1], n addton to the states of the characters or attrbutes of the OTUs, we ntroduce here the new SPECTRAL concepts of )OBJECTS and )FAMILY SPECTRA. Wthn the taxonomc space ths method of clusterng delmts taxonomc groups n such a manner that they can be vsualzed as characterstc spectra of an OTU and characterstc spectra of the famles. We defne an ndvdual spectral metrc for the set of dstances between an OTU and the other OTUs of the set. Each one provdes the states of the characters and, therefore, s constant for each OTU, f the taxonomc condtons do not change (n analogy wth the fasors) havng an ndvdual taxonomc spectrum (ITS). The spectrum of taxonomc smlarty s the set of dstances between the OTUs of the set, that determne the constant characterstcs of a cluster or famly, for a gven type of taxonomc condtons. Invarants are found that characterze each cluster. Among them we menton the varance, the radus, the densty and the centrod. These nvarants are assocated wth the spectra of taxonomc smlarty that dentfy each famly Tests of Intellgent Data Mnng A software system was constructed to evaluate the C4.5 algorthm. Ths system takes the tranng data as an nput and allows the user to choose whether he wants to construct a decson tree accordng to the C4.5. If the user chooses the C4.5, the decson tree s generated, then t s pruned and the decson rules are bult. The decson tree and the ruleset generated by the C4.5 are evaluated separate from each other. We use the system to test the algorthms n dfferent domans, manly Elta: a base of asterods Compute of the Informaton Gan In the cases, n those whch the set T contans examples belongng to dfferent classes, s accomplshed a test on the dfferent attrbutes and s accomplshed a partton accordng to the "better" attrbute. To fnd the "better" attrbute, s used the theory of the nformaton, that supports that the nformaton s maxmzed when the entropy s mnmzed. The entropy determnes the randomness or dsorder of a set. We suppose that we have negatve and postve examples. In ths context the entropy of the subset S, H(S ), t can be calculated as: + + H ( S ) = p log p p log p (3.4.1) + Where p s the probablty of a example s taken n random mode of S wll be postve. Ths probablty may be calculated as + + n p = (3.4.2) + n + n Beng + n the quantty of postves examples of S, and n the quantty of negatves examples. + The probablty p s calculated n analogous form to p, replacng the quantty of postves examples by the quantty of negatves examples, and conversely. Generalzng the expresson (3.4.1) for any type of examples, we obtan the general formulaton of the entropy: n H ( S ) = p log p (3.4.3) = 1 In all the calculatons related to the entropy, we defne 0log0 equal to 0. If the attrbute at dvde the set S n the subsets S, = 1,2,....., n, then, the total entropy of the system of subsets wll be: n ( S ) H ( ) H ( S, at) = P (3.4.4) = 1 S Where ( ) S H s the entropy of the subset P S s the probablty of the fact that an example belong to S. S and ( ) It can be calculate, used the relatve szes of the subsets, as: S P( S ) = (3.4.5) S The gan of nformaton may be calculate as the decrease n entropy. Thus: I S, at = H S H S, at (3.4.6) ( ) ( ) ( )
7 H s the value of the entropy a pror, before H, s the value of the entropy of the subsets system generated by the partton accordng to at. The use of the entropy to evaluate the best attrbute s not the only one exstng method or used n Automatc Learnng. However, t s used by Qunlan upon developng the ID3 and hs succeedng the C4.5. Where ( S ) accomplshng the subdvson, and ( S at) Numercal Data The decson trees can be generated so much as dscrete attrbutes as contnous attrbutes. When t s worked wth dscrete attrbutes, the partton of the set accordng to the value of an attrbute s smple. To solve ths problem, t can be appealed to the bnary method. Ths method conssts n formng two ranges of agreement values to the value of an attrbute, that they can be taken as symbolc. 4. Results and Conclusons Results of the C4.5. The C4.5 wth postprunng results n trees smaller and less bushy. If we analyze the trees obtaned n the doman, we ll see that the percentages of error obtaned wth the C4.5 are between a 3% and a 3.7%, snce that the C4.5 generate smaller trees and smaller rulesets. Dervatve of the fact that each leaf n a tree generated covers a dstrbuton of classes Error percentage {ELITA} { [1]: C4.5Gan Trees [2]: C4.5Gan Rulers [3]: C4.5Proporton of Gan Trees [4]: C4.5Rulers Proporton of Gan Trees} < 3% From the analyss of ths value we could conclude that no method can generate a clearly superor model for the doman. On the contrary, we could state that the error percentage doesn t appear to depend on the method used, but on the analyzed doman Hypothess space The hypothess space for ths algorthm s complete accordng to the avalable attrbutes. Because any value test can be represented wth a decson tree, ths algorthm avod one of the prncpal rsks of nductve method that works reducng the spaces of the hypothess. An mportant feature of the C4.5 algorthm s that t use all the avalable data n each step to chose the best attrbute; ths s a decson that s made wth statstc method. Ths fact favors ths algorthm over other algorthms because analyze how the nput dataset take the representaton nto decson trees n consstent forms. Once an attrbute has been selected as a decson node, the algorthm does not go back over ther choces. Ths s the reason why ths algorthm can converge to a local maxmum[20]. The C4.5 algorthm adds a certan degree of reconsderaton of ts choces n the postprunng of the decson trees. Nevertheless, we can state that the results show that the proporton of error depends on the data doman. For future study, we suggest an analyss the nput datasets wth the numercal method of clusterng and choosng for the doman the method that mantans a low percentage error n extended databases as a robustness of the method. 5. Corollary From what has been sad, the work uses the Sequental, Agglomeratve, Herarchc and Nonoverlappng clusterng procedures, spectral analyss crteron and nvarants to accomplsh classfcatons n extended databases, of proper asterod elements, to structure famles. The preclassfed data s an mportant nput to Intellgent Data Mnng, and Computatonal Taxonomy n Databases wll have always a low percentage error n extended databases as a robustness of the method; to combne a sure result. References [1]Perchnsky, G., Orellana, R., Plastno, A.L., Jmenez Rey, E. and Gross, M.D. "Spectra of Taxonomc Evdence n Databases." Proceedngs of XVIII Internatonal Conference on Appled Informatcs. (Paper ).Innsbruck. Austra [2]Crsc, J.V., Lopez Armengol, M.F. "Introducton to Theory and Practce of the Numercal Taxonomy", A.S.O. Regonal Program of Scence and Technology for Development. Washngton D.C. Spansh [3]Gennar,J.H. A Survey of Clusterng Methods (b). Techncal Report Department of Computer Scence and Informatcs. Unversty of Calforna., Irvne, CA [4]Sokal, R.R., Sneath, P.H.A. "Numercal Taxonomy".W.H.Freeman and Company [5]Zappala, V, Cellno,A., Farnella,P., Mlan,A., The Astronomcal Journal, 107, [6]Abramson,N., Informaton Theory and Codng. McGraw Hll. Parannfo. Madrd [7]Hammng, R.W. Codng and nformaton theory. Englewood Clfs, NJ: Prentce Hall [8]Freeman,J.A., Skapura,D.M. Neural Networks. Algorthms, applcatons and technques of programmng. Addson Wesley. Iberoamercana. Spansh [9]Mchalsk, R. S A Theory and Methodology of Inductve Learnng. En Mchalsk, R. S., Carbonell, J. G., Mtchell, T. M. (1983) Machne Learnng: An Artfcal Intellgence Approach, Vol. I. MorganKauffman, USA. [10]Qunlan, J.R Inducton of Decson Trees. In Machne Learnng, Ch. 1, p Morgan Kaufmann. [11]Qunlan, J.R Generatng Producton Rules from Decson trees. Proceedng of the Tenth Internatonal Jont
A Study of the Cosine DistanceBased Mean Shift for Telephone Speech Diarization
TASL046013 1 A Study of the Cosne DstanceBased Mean Shft for Telephone Speech Darzaton Mohammed Senoussaou, Patrck Kenny, Themos Stafylaks and Perre Dumouchel Abstract Speaker clusterng s a crucal
More informationBoosting as a Regularized Path to a Maximum Margin Classifier
Journal of Machne Learnng Research 5 (2004) 941 973 Submtted 5/03; Revsed 10/03; Publshed 8/04 Boostng as a Regularzed Path to a Maxmum Margn Classfer Saharon Rosset Data Analytcs Research Group IBM T.J.
More informationSequential DOE via dynamic programming
IIE Transactons (00) 34, 1087 1100 Sequental DOE va dynamc programmng IRAD BENGAL 1 and MICHAEL CARAMANIS 1 Department of Industral Engneerng, Tel Avv Unversty, Ramat Avv, Tel Avv 69978, Israel Emal:
More informationDistributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers
Foundatons and Trends R n Machne Learnng Vol. 3, No. 1 (2010) 1 122 c 2011 S. Boyd, N. Parkh, E. Chu, B. Peleato and J. Ecksten DOI: 10.1561/2200000016 Dstrbuted Optmzaton and Statstcal Learnng va the
More informationDropout: A Simple Way to Prevent Neural Networks from Overfitting
Journal of Machne Learnng Research 15 (2014) 19291958 Submtted 11/13; Publshed 6/14 Dropout: A Smple Way to Prevent Neural Networks from Overfttng Ntsh Srvastava Geoffrey Hnton Alex Krzhevsky Ilya Sutskever
More informationMean Field Theory for Sigmoid Belief Networks. Abstract
Journal of Artæcal Intellgence Research 4 è1996è 61 76 Submtted 11è95; publshed 3è96 Mean Feld Theory for Sgmod Belef Networks Lawrence K. Saul Tomm Jaakkola Mchael I. Jordan Center for Bologcal and Computatonal
More informationDocumentation for the TIMES Model PART I
Energy Technology Systems Analyss Programme http://www.etsap.org/tools.htm Documentaton for the TIMES Model PART I Aprl 2005 Authors: Rchard Loulou Uwe Remne Amt Kanuda Antt Lehtla Gary Goldsten 1 General
More informationRECENT DEVELOPMENTS IN QUANTITATIVE COMPARATIVE METHODOLOGY:
Federco Podestà RECENT DEVELOPMENTS IN QUANTITATIVE COMPARATIVE METHODOLOGY: THE CASE OF POOLED TIME SERIES CROSSSECTION ANALYSIS DSS PAPERS SOC 302 INDICE 1. Advantages and Dsadvantages of Pooled Analyss...
More informationAssessing health efficiency across countries with a twostep and bootstrap analysis *
Assessng health effcency across countres wth a twostep and bootstrap analyss * Antóno Afonso # $ and Mguel St. Aubyn # February 2007 Abstract We estmate a semparametrc model of health producton process
More informationMaxMargin Early Event Detectors
MaxMargn Early Event Detectors Mnh Hoa Fernando De la Torre Robotcs Insttute, Carnege Mellon Unversty Abstract The need for early detecton of temporal events from sequental data arses n a wde spectrum
More informationMultiProduct Price Optimization and Competition under the Nested Logit Model with ProductDifferentiated Price Sensitivities
MultProduct Prce Optmzaton and Competton under the Nested Logt Model wth ProductDfferentated Prce Senstvtes Gullermo Gallego Department of Industral Engneerng and Operatons Research, Columba Unversty,
More informationTurbulence Models and Their Application to Complex Flows R. H. Nichols University of Alabama at Birmingham
Turbulence Models and Ther Applcaton to Complex Flows R. H. Nchols Unversty of Alabama at Brmngham Revson 4.01 CONTENTS Page 1.0 Introducton 1.1 An Introducton to Turbulent Flow 11 1. Transton to Turbulent
More informationHuman Tracking by Fast Mean Shift Mode Seeking
JOURAL OF MULTIMEDIA, VOL. 1, O. 1, APRIL 2006 1 Human Trackng by Fast Mean Shft Mode Seekng [10 font sze blank 1] [10 font sze blank 2] C. Belezna Advanced Computer Vson GmbH  ACV, Venna, Austra Emal:
More informationAlgebraic Point Set Surfaces
Algebrac Pont Set Surfaces Gae l Guennebaud Markus Gross ETH Zurch Fgure : Illustraton of the central features of our algebrac MLS framework From left to rght: effcent handlng of very complex pont sets,
More informationA Structure for General and Specc Market Rsk Eckhard Platen 1 and Gerhard Stahl Summary. The paper presents a consstent approach to the modelng of general and specc market rsk as dened n regulatory documents.
More informationEVERY GOOD REGULATOR OF A SYSTEM MUST BE A MODEL OF THAT SYSTEM 1
Int. J. Systems Sc., 1970, vol. 1, No. 2, 8997 EVERY GOOD REGULATOR OF A SYSTEM MUST BE A MODEL OF THAT SYSTEM 1 Roger C. Conant Department of Informaton Engneerng, Unversty of Illnos, Box 4348, Chcago,
More informationFrom Computing with Numbers to Computing with Words From Manipulation of Measurements to Manipulation of Perceptions
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: FUNDAMENTAL THEORY AND APPLICATIONS, VOL. 45, NO. 1, JANUARY 1999 105 From Computng wth Numbers to Computng wth Words From Manpulaton of Measurements to Manpulaton
More informationAdverse selection in the annuity market when payoffs vary over the time of retirement
Adverse selecton n the annuty market when payoffs vary over the tme of retrement by JOANN K. BRUNNER AND SUSANNE PEC * July 004 Revsed Verson of Workng Paper 0030, Department of Economcs, Unversty of nz.
More informationBRNO UNIVERSITY OF TECHNOLOGY
BRNO UNIVERSITY OF TECHNOLOGY FACULTY OF INFORMATION TECHNOLOGY DEPARTMENT OF INTELLIGENT SYSTEMS ALGORITHMIC AND MATHEMATICAL PRINCIPLES OF AUTOMATIC NUMBER PLATE RECOGNITION SYSTEMS B.SC. THESIS AUTHOR
More informationPerson Reidentification by Probabilistic Relative Distance Comparison
Person Redentfcaton by Probablstc Relatve Dstance Comparson WeSh Zheng 1,2, Shaogang Gong 2, and Tao Xang 2 1 School of Informaton Scence and Technology, Sun Yatsen Unversty, Chna 2 School of Electronc
More informationSupport vector domain description
Pattern Recognton Letters 20 (1999) 1191±1199 www.elsever.nl/locate/patrec Support vector doman descrpton Davd M.J. Tax *,1, Robert P.W. Dun Pattern Recognton Group, Faculty of Appled Scence, Delft Unversty
More informationCREDIT RISK AND EFFICIENCY IN THE EUROPEAN BANKING SYSTEMS: A THREESTAGE ANALYSIS*
CREDIT RISK AD EFFICIECY I THE EUROPEA BAKIG SYSTEMS: A THREESTAGE AALYSIS* José M. Pastor WPEC 998 Correspondenca a: José M. Pastor: Departamento de Análss Económco, Unverstat de Valènca, Campus dels
More informationSolving the IndexNumber Problem in a Historical Perspective
Solvng the IndexNumber roblem n a Hstorcal erspectve Carlo Mlana * January 2009 "The fundamental and wellknown theorem for the exstence of a prce ndex that s nvarant under change n level of lvng s that
More informationJournal of International Economics
Journal of Internatonal Economcs 79 (009) 31 41 Contents lsts avalable at ScenceDrect Journal of Internatonal Economcs journal homepage: www.elsever.com/locate/je Composton and growth effects of the current
More informationThe Effects of Increasing Openness and Integration to the MERCOSUR on the Uruguayan Labour Market: A CGE Modeling Analysis 1.
The Effects of Increasng Openness and Integraton to the MERCOSUR on the Uruguayan Labour Market: A CGE Modelng Analyss 1. María Inés Terra 2, Marsa Buchel 2, Slva Laens 3, Carmen Estrades 2 November 2005
More informationEnsembling Neural Networks: Many Could Be Better Than All
Artfcal Intellgence, 22, vol.37, no.2, pp.239263. @Elsever Ensemblng eural etworks: Many Could Be Better Than All ZhHua Zhou*, Janxn Wu, We Tang atonal Laboratory for ovel Software Technology, anng
More informationCiphers with Arbitrary Finite Domains
Cphers wth Arbtrary Fnte Domans John Black 1 and Phllp Rogaway 2 1 Dept. of Computer Scence, Unversty of Nevada, Reno NV 89557, USA, jrb@cs.unr.edu, WWW home page: http://www.cs.unr.edu/~jrb 2 Dept. of
More informationSectorSpecific Technical Change
SectorSpecfc Techncal Change Susanto Basu, John Fernald, Jonas Fsher, and Mles Kmball 1 November 2013 Abstract: Theory mples that the economy responds dfferently to technology shocks that affect the producton
More informationDo Firms Maximize? Evidence from Professional Football
Do Frms Maxmze? Evdence from Professonal Football Davd Romer Unversty of Calforna, Berkeley and Natonal Bureau of Economc Research Ths paper examnes a sngle, narrow decson the choce on fourth down n the
More informationOutofSample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering
OutofSample Extensons for LLE, Isomap, MDS, Egenmaps, and Spectral Clusterng Yoshua Bengo, JeanFranços Paement, Pascal Vncent Olver Delalleau, Ncolas Le Roux and Mare Oumet Département d Informatque
More information