Cluster-based Sampling Approaches to Imbalanced Data Distributions for PAKDD 2007 Competition
|
|
- Melanie Cummings
- 7 years ago
- Views:
Transcription
1 Cluster-based Samplng Approaches to Imbalanced Data Dstrbutons for PAKDD 2007 Competton Yue-Sh Lee and Show-Jane Yen Department of Computer Scence and Informaton Engneerng, Mng Chuan Unversty 5 The-Mng Rd., Gwe Shan Dstrct, Taoyuan County 333, Tawan {sjyen,leeys}@mcu.edu.tw Abstract. For classfcaton problem, the tranng data wll sgnfcantly nfluence the classfcaton accuracy. When the data set s hghly unbalanced, classfcaton algorthms tend to degenerate by assgnng all cases to the most common outcome. Hence, t s mportant to select the sutable tranng data for classfcaton n the mbalanced class dstrbuton problem. In ths paper, we propose cluster-based under-samplng approaches for selectng the representatve data as tranng data to mprove the classfcaton accuracy n the mbalanced class dstrbuton envronment,.e., PAKDD competton data set. The CART (Classfcaton and Regresson Tree) classfcaton algorthm s consdered. The expermental results show that our cluster-based under-samplng approaches can perform the tradtonal approaches. 1 Introducton The classfcaton technques usually assume that the tranng samples are unformly-dstrbuted between dfferent classes. A classfer performs well when the classfcaton technque s appled to a dataset evenly dstrbuted among dfferent classes. However, many datasets n real applcatons nvolve mbalanced class dstrbuton problem [5, 7]. The mbalanced class dstrbuton problem occurs whle there are much more samples n one class than the other class n a tranng dataset. In an mbalanced dataset, the majorty class has a large percent of all the samples, whle the samples n mnorty class just occupy a small part of all the samples. In ths case, a classfer usually tends to predct that samples have the majorty class and completely gnore the mnorty class. One smple method of under-samplng s to select a subset of MA randomly and then combne them wth MI as a tranng set, whch s called random under-samplng approach. Several advanced researches are proposed to make the selectve samples more representatve. The under-samplng approach based on dstance [7] uses dstnct modes: the nearest, the farthest, the average nearest, and the average farthest dstances between MI and MA, as four standards to select the representatve samples from MA. For every mnorty class sample n the dataset, the frst method nearest calculates the dstances between all majorty class samples and the mnorty class
2 samples, and selects k majorty class samples whch have the smallest dstances to the mnorty class sample. If there are n mnorty class samples n the dataset, the nearest approach would fnally select k n majorty class samples (k 1). However, some samples wthn the selected majorty class samples mght duplcate. Smlar to the nearest approach, the farthest approach selects the majorty class samples whch have the farthest dstances to each mnorty class samples. For every majorty class samples n the dataset, the thrd method average nearest calculates the average dstance between one majorty class sample and all mnorty class samples. Ths approach selects the majorty class samples whch have the smallest average dstances. The last method average farthest s smlar to the average nearest approach; t selects the majorty class samples whch have the farthest average dstances wth all the mnorty class samples. The above under-samplng approaches based on dstance n [7] spend a lot of tme selectng the majorty class samples n the large dataset, and they are not effcent n real applcatons. In 2003, J. Zhang and I. Man [6] presented the compared results wthn four nformed under-samplng approaches and random under-samplng approach. The frst method NearMss-1 selects the majorty class samples whch are close to some mnorty class samples. In ths method, majorty class samples are selected whle ther average dstances to three closest mnorty class samples are the smallest. The second method NearMss-2 selects the majorty class samples whle ther average dstances to three farthest mnorty class samples are the smallest. The thrd method NearMss- 3 take out a gven number of the closest majorty class samples for each mnorty class sample. Fnally, the fourth method Most dstant selects the majorty class samples whose average dstances to the three closest mnorty class samples are the largest. The fnal expermental results n [6] showed that the NearMss-2 approach and random under-samplng approach perform the best. In ths paper, we study the effects of under-samplng [1, 3, 6] on the CART (Classfcaton and Regresson Tree) classfcaton algorthm and propose some new under-samplng approaches based on clusterng, such that the nfluence of mbalanced class dstrbuton can be decreased and the accuracy of predctng the mnorty class can be ncreased. 2 Our Approaches In ths secton, we present our approach SBC (under-samplng Based on Clusterng) whch focuses on the under-samplng approach and uses clusterng technques to solve the mbalanced class dstrbuton problem. Our approach frst clusters all the tranng samples nto some clusters. The man dea s that there are dfferent clusters n a dataset, and each cluster seems to have dstnct characterstcs. If a cluster has more majorty class samples and less mnorty class samples, t wll behave lke the majorty class samples. On the opposte, f a cluster has more mnorty class samples and less majorty class samples, t doesn t hold the characterstcs of the majorty class samples and behaves more lke the mnorty class samples. Therefore, our approach SBC selects a sutable number of majorty class samples from each cluster by
3 consderng the rato of the number of majorty class samples to the number of mnorty class samples n the cluster. 2.1 Under-samplng based on clusterng Assume that the number of samples n the class-mbalanced dataset s N, whch ncludes majorty class samples (MA) and mnorty class samples (MI). The sze of the dataset s the number of the samples n ths dataset. The sze of MA s represented as Sze MA, and Sze MI s the number of samples n MI. In the class-mbalanced dataset, Sze MA s far larger than Sze MI. For our under-samplng method SBC, we frst cluster all samples n the dataset nto K clusters. The number of majorty class samples and the number of mnorty class samples n the th cluster (1 K) are Sze MA and Sze MI, respectvely. Therefore, the rato of the number of majorty class samples to the number of mnorty class samples n the th cluster s Sze MA / SzeMI. If the rato of Sze MA to Sze MI n the tranng dataset s set to be m:1, the number of selected majorty class samples n the th cluster s shown n expresson (1): SzeMA SzeMI SSze MA = ( m SzeMI) (1) K Sze MA Sze = 1 MI In expresson (1), m SzeMI s the total number of selected majorty class samples that we suppose to have n the fnal tranng dataset. K Sze MA s the total = 1 SzeMI rato of the number of majorty class samples to the number of mnorty class samples n all clusters. Expresson (1) determnes that more majorty class samples would be selected n the cluster whch behaves more lke the majorty class samples. In other words, SSze MA s larger whle the th cluster has more majorty class samples and less mnorty class samples. After determnng the number of majorty class samples whch are selected n the th cluster, 1 K, by usng expresson (1), we randomly choose majorty class samples n the th cluster. The total number of selected majorty class samples s m Sze MI after mergng all the selected majorty class samples n each cluster. At last, we combne the whole mnorty class samples wth the selected majorty class samples to construct a new tranng dataset. Table 1 shows the steps for our under-samplng approach. For example, assume that an mbalanced class dstrbuton dataset has totally 1100 samples. The sze of MA s 1000 and the sze of MI s 100. In ths example, we cluster ths dataset nto three clusters. Table 2 shows the number of majorty class
4 samples Sze MA, the number of mnorty class samples Sze MI, and the rato of Sze MA to Sze MI for the th cluster. Table 1. The structure of the under-samplng based on clusterng approach SBC Step1. Determne the rato of Sze MA to Sze MI n the tranng dataset. Step2. Cluster all the samples n the dataset nto some clusters. Step3. Determne the number of selected majorty class samples n each cluster by usng expresson (1), and then randomly select the majorty class samples n each cluster. Step4. Combne the selected majorty class samples and all the mnorty class samples to obtan the tranng dataset. Table 2. Cluster descrptons Cluster ID Number of majorty Number of mnorty Sze MA / SzeMI class samples class samples /10= /50= /40=5 Assume that the rato of Sze MA to Sze MI n the tranng data s set to be 1:1, n other words, there are 100 selected majorty class samples and the whole 100 mnorty class samples n ths tranng dataset. The number of selected majorty class samples n each cluster can be calculated by expresson (1). Table 3 shows the number of selected majorty class samples n each cluster. We fnally select the majorty samples randomly from each cluster and combne them wth the mnorty samples to form the new dataset. Table 3. The number of selected majorty class samples n each cluster Cluster ID The number of selected majorty class samples / (50+6+5) = / (50+6+5) = / (50+6+5)= Under-samplng based on clusterng and dstances In SBC method, all the samples are clustered nto several clusters and the number of selected majorty class samples s determned by expresson (1). Fnally, the majorty class samples are randomly selected from each cluster. In ths secton, we propose other two under-samplng methods, whch are based on SBC approach. The dfference between the two proposed under-samplng methods and SBC method s the way to select the majorty class samples from each cluster. For the two proposed methods, the majorty class samples are selected accordng to the dstances between
5 MaxA MnA the majorty class samples and the mnorty class samples n each cluster. Hence, the dstances between samples wll be computed. For a contnuous attrbute, the values of all samples for ths attrbute need to be normalzed n order to avod the effect of dfferent scales for dfferent attrbutes. For example, suppose A s a contnuous attrbute. In order to normalze the values of attrbute A for all the samples, we frst fnd the maxmum value Max A and the mnmum value Mn A of A for all samples. To le an attrbute value a n between 0 to 1, a a MnA s normalzed to. For a categorcal or dscrete attrbute, the dstance between two attrbute values x 1 and x 2 s 1 (.e. x 1 -x 2 =1) whle x 1 s not equal to x 2, and the dstance s 0 (.e. x 1 -x 2 =0) whle they are the same. X Assume that there are N attrbutes n a dataset and represents the value of attrbute A n sample X, for 1 N. The Eucldean dstance between two samples X and Y s shown n expresson (2). V dstance ( X, Y ) = N X Y 2 ( V V ) = 1 (2) The two approaches we proposed n ths secton frst cluster all samples nto K (K 1) clusters as well, and determne the number of selected majorty class samples for each cluster by expresson (1). For each cluster, the representatve majorty class samples are selected n dfferent ways. The frst method SBCMD (Samplng Based on Clusterng wth Most Dstant) selects the majorty class samples whose average dstances to M closest mnorty class samples n the th cluster are the farthest. The second method, whch s called SBCMF (Samplng Based on Clusterng wth Most Far), selects the majorty class samples whose average dstances to all mnorty class samples n the cluster are the farthest. 3 Data Preprocessng Because PAKDD competton data set contans many noses such as null or mssng value and wrong value. Therefore, we need to perform data preprocessng. Frst of all, we compute the null value dstrbuton for all attrbutes. After ths step, we remove 6 attrbutes (AMEX_CARD, DINERS_CARD, VISA_CARD, MASTERCARD, RETAIL_CARDS, DISP_INCOME_CODE) from the tranng data set due to ther hgh percentage of null value. Then, we encode the attrbute values for several attrbutes to get a reasonable value. For example, we encode the attrbute value 98 to 0 and 99 to 12 n B_ENQ_L1M attrbute. After encode each attrbute, we calculate the attrbute sgnfcance for each attrbute and fnally only select 14 attrbutes for tranng our classfcaton model. They are lsted below.
6 B_ENQ_L6M_GR3 B_ENQ_L12M_GR3 CURR_RES_MTHS AGE_AT_APPLICATION B_ENQ_L3M B_ENQ_L1M B_ENQ_L6M CURR_EMPL_MTHS B_ENQ_L12M_GR2 B_ENQ_L6M_GR2 PREV_RES_MTHS NBR_OF_DEPENDANTS B_ENQ_L6M_GR2C Number of Bureau Enqures n the Last 6 Months for Mortgages Number of Bureau Enqures n the Last 12 Month for Mortgages Total Number of Months at Current Resdence Customer's Age (n Years) upon Applcaton Number of Bureau Enqures n the Last 3 months Number of Bureau Enqures n the Last 1 months Number of Bureau Enqures n the Last 6 months Total Number of Months at Current Employment Number of Bureau Enqures n the Last 12 Month for Loans Number of Bureau Enqures n the Last 6 Months for Loans Total Number of Months at Prevous Address Number of Dependents Number of Bureau Enqures n the Last 6 Months for Loans ANNUAL_INCOME_RANGE Number of Bureau Enqures n the Last 6 months 4 Expermental Results We do several experments and the best one s lsted below. Fg. 1 Expermental Results
7 5 Concluson In a classfcaton task, the effect of mbalanced class dstrbuton problem s often gnored. Many studes [2, 4] focused on mprovng the classfcaton accuracy but dd not consder the mbalanced class dstrbuton problem. Hence, the classfers whch are constructed by these studes lose the ablty to correctly predct the correct decson class for the mnorty class samples n the datasets whch the number of majorty class samples are much greater than the number of mnorty class samples. Many real applcatons, lke rarely-seen dsease nvestgaton, credt card fraud detecton, and nternet ntruson detecton always nvolve the mbalanced class dstrbuton problem. It s hard to make rght predctons on the customers or patents who that we are nterested n. In ths study, we propose cluster-based under-samplng approaches to solve the mbalanced class dstrbuton problem by usng CART (Classfcaton and Regresson Tree). The expermental results show that our cluster-based under-samplng approaches can perform the tradtonal approaches. References 1. Chawla, N. V.: C4.5 and Imbalanced Datasets: Investgatng the Effect of Samplng Method, Probablstc Estmate, and Decson Tree Structure. Proceedngs of the ICML 03 Workshop on Class Imbalances (2003). 2. Caragea, D., Cook, D., Honavar, V.: Ganng Insghts nto Support Vector Machne Pattern Classfers Usng Projecton-Based Tour Methods. Proceedngs of the KDD Conference, San Francsco, CA (2001) Drummond, C., Holte, R. C.: C4.5, Class Imbalance, and Cost Senstvty: Why Under- Samplng Beats Over-Samplng. Proceedngs of the ICML 03 Workshop on Learnng from Imbalanced Datasets (2003). 4. del-hoyo, R., Buldan, D., Marco, A.: Supervsed Classfcaton wth Assocatve SOM. Lecture Notes n Computer Scence, Vol.2686 (2003) Japkowcz, N.: Concept-learnng n the Presence of Between-class and Wthn-class Imbalances. Proceedngs of the Fourteenth Conference of the Canadan Socety for Computatonal Studes of Intellgence (2001) Zhang, J., Man, I.: knn Approach to Unbalanced Data Dstrbutons: A Case Study Involvng Informaton Extracton. Proceedngs of the ICML 2003 Workshop on Learnng from Imbalanced Datasets (2003). 7. Chy, Y.M.: Classfcaton Analyss Technques for Skewed Class Dstrbuton Problems, Master Thess, Department of Informaton Management, Natonal Sun Yat-Sen Unversty (2003).
The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis
The Development of Web Log Mnng Based on Improve-K-Means Clusterng Analyss TngZhong Wang * College of Informaton Technology, Luoyang Normal Unversty, Luoyang, 471022, Chna wangtngzhong2@sna.cn Abstract.
More informationForecasting the Direction and Strength of Stock Market Movement
Forecastng the Drecton and Strength of Stock Market Movement Jngwe Chen Mng Chen Nan Ye cjngwe@stanford.edu mchen5@stanford.edu nanye@stanford.edu Abstract - Stock market s one of the most complcated systems
More informationAn Interest-Oriented Network Evolution Mechanism for Online Communities
An Interest-Orented Network Evoluton Mechansm for Onlne Communtes Cahong Sun and Xaopng Yang School of Informaton, Renmn Unversty of Chna, Bejng 100872, P.R. Chna {chsun,yang}@ruc.edu.cn Abstract. Onlne
More informationMining Multiple Large Data Sources
The Internatonal Arab Journal of Informaton Technology, Vol. 7, No. 3, July 2 24 Mnng Multple Large Data Sources Anmesh Adhkar, Pralhad Ramachandrarao 2, Bhanu Prasad 3, and Jhml Adhkar 4 Department of
More informationLogistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification
Lecture 4: More classfers and classes C4B Machne Learnng Hlary 20 A. Zsserman Logstc regresson Loss functons revsted Adaboost Loss functons revsted Optmzaton Multple class classfcaton Logstc Regresson
More informationForecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network
700 Proceedngs of the 8th Internatonal Conference on Innovaton & Management Forecastng the Demand of Emergency Supples: Based on the CBR Theory and BP Neural Network Fu Deqang, Lu Yun, L Changbng School
More informationFeature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College
Feature selecton for ntruson detecton Slobodan Petrovć NISlab, Gjøvk Unversty College Contents The feature selecton problem Intruson detecton Traffc features relevant for IDS The CFS measure The mrmr measure
More informationCluster Analysis of Data Points using Partitioning and Probabilistic Model-based Algorithms
Internatonal Journal of Appled Informaton Systems (IJAIS) ISSN : 2249-0868 Foundaton of Computer Scence FCS, New York, USA Volume 7 No.7, August 2014 www.jas.org Cluster Analyss of Data Ponts usng Parttonng
More informationBayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending
Proceedngs of 2012 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 25 (2012) (2012) IACSIT Press, Sngapore Bayesan Network Based Causal Relatonshp Identfcaton and Fundng Success
More informationWhat is Candidate Sampling
What s Canddate Samplng Say we have a multclass or mult label problem where each tranng example ( x, T ) conssts of a context x a small (mult)set of target classes T out of a large unverse L of possble
More informationA DATA MINING APPLICATION IN A STUDENT DATABASE
JOURNAL OF AERONAUTICS AND SPACE TECHNOLOGIES JULY 005 VOLUME NUMBER (53-57) A DATA MINING APPLICATION IN A STUDENT DATABASE Şenol Zafer ERDOĞAN Maltepe Ünversty Faculty of Engneerng Büyükbakkalköy-Istanbul
More information"Research Note" APPLICATION OF CHARGE SIMULATION METHOD TO ELECTRIC FIELD CALCULATION IN THE POWER CABLES *
Iranan Journal of Scence & Technology, Transacton B, Engneerng, ol. 30, No. B6, 789-794 rnted n The Islamc Republc of Iran, 006 Shraz Unversty "Research Note" ALICATION OF CHARGE SIMULATION METHOD TO ELECTRIC
More informationData Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network *
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 24, 819-840 (2008) Data Broadcast on a Mult-System Heterogeneous Overlayed Wreless Network * Department of Computer Scence Natonal Chao Tung Unversty Hsnchu,
More informationDetecting Credit Card Fraud using Periodic Features
Detectng Credt Card Fraud usng Perodc Features Alejandro Correa Bahnsen, Djamla Aouada, Aleksandar Stojanovc and Björn Ottersten Interdscplnary Centre for Securty, Relablty and Trust Unversty of Luxembourg,
More informationTHE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION
Internatonal Journal of Electronc Busness Management, Vol. 3, No. 4, pp. 30-30 (2005) 30 THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION Yu-Mn Chang *, Yu-Cheh
More informationFace Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)
Face Recognton Problem Face Verfcaton Problem Face Verfcaton (1:1 matchng) Querymage face query Face Recognton (1:N matchng) database Applcaton: Access Control www.vsage.com www.vsoncs.com Bometrc Authentcaton
More informationThe Greedy Method. Introduction. 0/1 Knapsack Problem
The Greedy Method Introducton We have completed data structures. We now are gong to look at algorthm desgn methods. Often we are lookng at optmzaton problems whose performance s exponental. For an optmzaton
More informationKeywords : classifier, Association rules, data mining, healthcare, Associative Classifiers, CBA, CMAR, CPAR, MCAR. GJCST Classification : H.2.
Global Journal of Computer Scence and Technology Volume 11 Issue 22 Verson 1.0 Type: Double Blnd Peer Revewed Internatonal Research Journal Publsher: Global Journals Inc. (USA) Onlne ISSN: 0975-4172 &
More informationUsing Association Rule Mining: Stock Market Events Prediction from Financial News
Usng Assocaton Rule Mnng: Stock Market Events Predcton from Fnancal News Shubhang S. Umbarkar 1, Prof. S. S. Nandgaonkar 2 1 Savtrba Phule Pune Unversty, Vdya Pratshtan s College of Engneerng, Vdya Nagar,
More informationHow To Know The Components Of Mean Squared Error Of Herarchcal Estmator S
S C H E D A E I N F O R M A T I C A E VOLUME 0 0 On Mean Squared Error of Herarchcal Estmator Stans law Brodowsk Faculty of Physcs, Astronomy, and Appled Computer Scence, Jagellonan Unversty, Reymonta
More informationAn Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services
An Evaluaton of the Extended Logstc, Smple Logstc, and Gompertz Models for Forecastng Short Lfecycle Products and Servces Charles V. Trappey a,1, Hsn-yng Wu b a Professor (Management Scence), Natonal Chao
More informationCHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol
CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK Sample Stablty Protocol Background The Cholesterol Reference Method Laboratory Network (CRMLN) developed certfcaton protocols for total cholesterol, HDL
More informationGender Classification for Real-Time Audience Analysis System
Gender Classfcaton for Real-Tme Audence Analyss System Vladmr Khryashchev, Lev Shmaglt, Andrey Shemyakov, Anton Lebedev Yaroslavl State Unversty Yaroslavl, Russa vhr@yandex.ru, shmaglt_lev@yahoo.com, andrey.shemakov@gmal.com,
More informationPEER REVIEWER RECOMMENDATION IN ONLINE SOCIAL LEARNING CONTEXT: INTEGRATING INFORMATION OF LEARNERS AND SUBMISSIONS
PEER REVIEWER RECOMMENDATION IN ONLINE SOCIAL LEARNING CONTEXT: INTEGRATING INFORMATION OF LEARNERS AND SUBMISSIONS Yunhong Xu, Faculty of Management and Economcs, Kunmng Unversty of Scence and Technology,
More informationPerformance Analysis and Coding Strategy of ECOC SVMs
Internatonal Journal of Grd and Dstrbuted Computng Vol.7, No. (04), pp.67-76 http://dx.do.org/0.457/jgdc.04.7..07 Performance Analyss and Codng Strategy of ECOC SVMs Zhgang Yan, and Yuanxuan Yang, School
More informationNEURO-FUZZY INFERENCE SYSTEM FOR E-COMMERCE WEBSITE EVALUATION
NEURO-FUZZY INFERENE SYSTEM FOR E-OMMERE WEBSITE EVALUATION Huan Lu, School of Software, Harbn Unversty of Scence and Technology, Harbn, hna Faculty of Appled Mathematcs and omputer Scence, Belarusan State
More informationECE544NA Final Project: Robust Machine Learning Hardware via Classifier Ensemble
1 ECE544NA Fnal Project: Robust Machne Learnng Hardware va Classfer Ensemble Sa Zhang, szhang12@llnos.edu Dept. of Electr. & Comput. Eng., Unv. of Illnos at Urbana-Champagn, Urbana, IL, USA Abstract In
More informationA COLLABORATIVE TRADING MODEL BY SUPPORT VECTOR REGRESSION AND TS FUZZY RULE FOR DAILY STOCK TURNING POINTS DETECTION
A COLLABORATIVE TRADING MODEL BY SUPPORT VECTOR REGRESSION AND TS FUZZY RULE FOR DAILY STOCK TURNING POINTS DETECTION JHENG-LONG WU, PEI-CHANN CHANG, KAI-TING CHANG Department of Informaton Management,
More informationA Performance Analysis of View Maintenance Techniques for Data Warehouses
A Performance Analyss of Vew Mantenance Technques for Data Warehouses Xng Wang Dell Computer Corporaton Round Roc, Texas Le Gruenwald The nversty of Olahoma School of Computer Scence orman, OK 739 Guangtao
More information1. Measuring association using correlation and regression
How to measure assocaton I: Correlaton. 1. Measurng assocaton usng correlaton and regresson We often would lke to know how one varable, such as a mother's weght, s related to another varable, such as a
More informationDescriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications
CMSC828G Prncples of Data Mnng Lecture #9 Today s Readng: HMS, chapter 9 Today s Lecture: Descrptve Modelng Clusterng Algorthms Descrptve Models model presents the man features of the data, a global summary
More informationSearching for Interacting Features for Spam Filtering
Searchng for Interactng Features for Spam Flterng Chuanlang Chen 1, Yun-Chao Gong 2, Rongfang Be 1,, and X. Z. Gao 3 1 Department of Computer Scence, Bejng Normal Unversty, Bejng 100875, Chna 2 Software
More informationNPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6
PAR TESTS If a WEIGHT varable s specfed, t s used to replcate a case as many tmes as ndcated by the weght value rounded to the nearest nteger. If the workspace requrements are exceeded and samplng has
More informationCalculation of Sampling Weights
Perre Foy Statstcs Canada 4 Calculaton of Samplng Weghts 4.1 OVERVIEW The basc sample desgn used n TIMSS Populatons 1 and 2 was a two-stage stratfed cluster desgn. 1 The frst stage conssted of a sample
More informationA novel Method for Data Mining and Classification based on
A novel Method for Data Mnng and Classfcaton based on Ensemble Learnng 1 1, Frst Author Nejang Normal Unversty;Schuan Nejang 641112,Chna, E-mal: lhan-gege@126.com Abstract Data mnng has been attached great
More information320 The Internatonal Arab Journal of Informaton Technology, Vol. 5, No. 3, July 2008 Comparsons Between Data Clusterng Algorthms Osama Abu Abbas Computer Scence Department, Yarmouk Unversty, Jordan Abstract:
More informationUsing Supervised Clustering Technique to Classify Received Messages in 137 Call Center of Tehran City Council
Usng Supervsed Clusterng Technque to Classfy Receved Messages n 137 Call Center of Tehran Cty Councl Mahdyeh Haghr 1*, Hamd Hassanpour 2 (1) Informaton Technology engneerng/e-commerce, Shraz Unversty (2)
More informationModule 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur
Module LOSSLESS IMAGE COMPRESSION SYSTEMS Lesson 3 Lossless Compresson: Huffman Codng Instructonal Objectves At the end of ths lesson, the students should be able to:. Defne and measure source entropy..
More information1 Example 1: Axis-aligned rectangles
COS 511: Theoretcal Machne Learnng Lecturer: Rob Schapre Lecture # 6 Scrbe: Aaron Schld February 21, 2013 Last class, we dscussed an analogue for Occam s Razor for nfnte hypothess spaces that, n conjuncton
More informationConversion between the vector and raster data structures using Fuzzy Geographical Entities
Converson between the vector and raster data structures usng Fuzzy Geographcal Enttes Cdála Fonte Department of Mathematcs Faculty of Scences and Technology Unversty of Combra, Apartado 38, 3 454 Combra,
More informationA Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression
Novel Methodology of Workng Captal Management for Large Publc Constructons by Usng Fuzzy S-curve Regresson Cheng-Wu Chen, Morrs H. L. Wang and Tng-Ya Hseh Department of Cvl Engneerng, Natonal Central Unversty,
More informationVision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION
Vson Mouse Saurabh Sarkar a* a Unversty of Cncnnat, Cncnnat, USA ABSTRACT The report dscusses a vson based approach towards trackng of eyes and fngers. The report descrbes the process of locatng the possble
More informationCalculating the high frequency transmission line parameters of power cables
< ' Calculatng the hgh frequency transmsson lne parameters of power cables Authors: Dr. John Dcknson, Laboratory Servces Manager, N 0 RW E B Communcatons Mr. Peter J. Ncholson, Project Assgnment Manager,
More informationHow To Predct On The Web For Hfmd
Proceedngs of the Twenty-Second Internatonal Jont Conference on Artfcal Intellgence Predctng Epdemc Tendency through Search Behavor Analyss Danqng Xu, Yqun Lu, Mn Zhang, Shaopng Ma, Anq Cu, Lyun Ru State
More informationDocument Clustering Analysis Based on Hybrid PSO+K-means Algorithm
Document Clusterng Analyss Based on Hybrd PSO+K-means Algorthm Xaohu Cu, Thomas E. Potok Appled Software Engneerng Research Group, Computatonal Scences and Engneerng Dvson, Oak Rdge Natonal Laboratory,
More informationVehicle Detection and Tracking in Video from Moving Airborne Platform
Journal of Computatonal Informaton Systems 10: 12 (2014) 4965 4972 Avalable at http://www.jofcs.com Vehcle Detecton and Trackng n Vdeo from Movng Arborne Platform Lye ZHANG 1,2,, Hua WANG 3, L LI 2 1 School
More informationCS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements
Lecture 3 Densty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 5329 Sennott Square Next lecture: Matlab tutoral Announcements Rules for attendng the class: Regstered for credt Regstered for audt (only f there
More informationA Secure Password-Authenticated Key Agreement Using Smart Cards
A Secure Password-Authentcated Key Agreement Usng Smart Cards Ka Chan 1, Wen-Chung Kuo 2 and Jn-Chou Cheng 3 1 Department of Computer and Informaton Scence, R.O.C. Mltary Academy, Kaohsung 83059, Tawan,
More informationThe OC Curve of Attribute Acceptance Plans
The OC Curve of Attrbute Acceptance Plans The Operatng Characterstc (OC) curve descrbes the probablty of acceptng a lot as a functon of the lot s qualty. Fgure 1 shows a typcal OC Curve. 10 8 6 4 1 3 4
More informationWeb Spam Detection Using Machine Learning in Specific Domain Features
Journal of Informaton Assurance and Securty 3 (2008) 220-229 Web Spam Detecton Usng Machne Learnng n Specfc Doman Features Hassan Najadat 1, Ismal Hmed 2 Department of Computer Informaton Systems Faculty
More informationMulti-sensor Data Fusion for Cyber Security Situation Awareness
Avalable onlne at www.scencedrect.com Proceda Envronmental Scences 0 (20 ) 029 034 20 3rd Internatonal Conference on Envronmental 3rd Internatonal Conference on Envronmental Scence and Informaton Applcaton
More informationImproved SVM in Cloud Computing Information Mining
Internatonal Journal of Grd Dstrbuton Computng Vol.8, No.1 (015), pp.33-40 http://dx.do.org/10.1457/jgdc.015.8.1.04 Improved n Cloud Computng Informaton Mnng Lvshuhong (ZhengDe polytechnc college JangSu
More informationOpen Access A Load Balancing Strategy with Bandwidth Constraint in Cloud Computing. Jing Deng 1,*, Ping Guo 2, Qi Li 3, Haizhu Chen 1
Send Orders for Reprnts to reprnts@benthamscence.ae The Open Cybernetcs & Systemcs Journal, 2014, 8, 115-121 115 Open Access A Load Balancng Strategy wth Bandwdth Constrant n Cloud Computng Jng Deng 1,*,
More informationL10: Linear discriminants analysis
L0: Lnear dscrmnants analyss Lnear dscrmnant analyss, two classes Lnear dscrmnant analyss, C classes LDA vs. PCA Lmtatons of LDA Varants of LDA Other dmensonalty reducton methods CSCE 666 Pattern Analyss
More informationFault tolerance in cloud technologies presented as a service
Internatonal Scentfc Conference Computer Scence 2015 Pavel Dzhunev, PhD student Fault tolerance n cloud technologes presented as a servce INTRODUCTION Improvements n technques for vrtualzaton and performance
More informationAn Analysis of Central Processor Scheduling in Multiprogrammed Computer Systems
STAN-CS-73-355 I SU-SE-73-013 An Analyss of Central Processor Schedulng n Multprogrammed Computer Systems (Dgest Edton) by Thomas G. Prce October 1972 Techncal Report No. 57 Reproducton n whole or n part
More informationDecision Tree Model for Count Data
Proceedngs of the World Congress on Engneerng 2012 Vol I Decson Tree Model for Count Data Yap Bee Wah, Norashkn Nasaruddn, Wong Shaw Voon and Mohamad Alas Lazm Abstract The Posson Regresson and Negatve
More informationThe Load Balancing of Database Allocation in the Cloud
, March 3-5, 23, Hong Kong The Load Balancng of Database Allocaton n the Cloud Yu-lung Lo and Mn-Shan La Abstract Each database host n the cloud platform often has to servce more than one database applcaton
More informationThe Use of Analytics for Claim Fraud Detection Roosevelt C. Mosley, Jr., FCAS, MAAA Nick Kucera Pinnacle Actuarial Resources Inc.
Paper 1837-2014 The Use of Analytcs for Clam Fraud Detecton Roosevelt C. Mosley, Jr., FCAS, MAAA Nck Kucera Pnnacle Actuaral Resources Inc., Bloomngton, IL ABSTRACT As t has been wdely reported n the nsurance
More informationData Mining Analysis and Modeling for Marketing Based on Attributes of Customer Relationship
School of athematcs and Systems Engneerng Reports from SI - Rapporter från SI Data nng Analyss and odelng for arketng Based on Attrbutes of Customer Relatonshp Xaoshan Du Sep 2006 SI Report 06129 Väö Unversty
More informationMining Feature Importance: Applying Evolutionary Algorithms within a Web-based Educational System
Mnng Feature Importance: Applyng Evolutonary Algorthms wthn a Web-based Educatonal System Behrouz MINAEI-BIDGOLI 1, and Gerd KORTEMEYER 2, and Wllam F. PUNCH 1 1 Genetc Algorthms Research and Applcatons
More informationOn-Line Fault Detection in Wind Turbine Transmission System using Adaptive Filter and Robust Statistical Features
On-Lne Fault Detecton n Wnd Turbne Transmsson System usng Adaptve Flter and Robust Statstcal Features Ruoyu L Remote Dagnostcs Center SKF USA Inc. 3443 N. Sam Houston Pkwy., Houston TX 77086 Emal: ruoyu.l@skf.com
More informationSingle and multiple stage classifiers implementing logistic discrimination
Sngle and multple stage classfers mplementng logstc dscrmnaton Hélo Radke Bttencourt 1 Dens Alter de Olvera Moraes 2 Vctor Haertel 2 1 Pontfíca Unversdade Católca do Ro Grande do Sul - PUCRS Av. Ipranga,
More information14.74 Lecture 5: Health (2)
14.74 Lecture 5: Health (2) Esther Duflo February 17, 2004 1 Possble Interventons Last tme we dscussed possble nterventons. Let s take one: provdng ron supplements to people, for example. From the data,
More informationAn interactive system for structure-based ASCII art creation
An nteractve system for structure-based ASCII art creaton Katsunor Myake Henry Johan Tomoyuk Nshta The Unversty of Tokyo Nanyang Technologcal Unversty Abstract Non-Photorealstc Renderng (NPR), whose am
More informationFREQUENCY OF OCCURRENCE OF CERTAIN CHEMICAL CLASSES OF GSR FROM VARIOUS AMMUNITION TYPES
FREQUENCY OF OCCURRENCE OF CERTAIN CHEMICAL CLASSES OF GSR FROM VARIOUS AMMUNITION TYPES Zuzanna BRO EK-MUCHA, Grzegorz ZADORA, 2 Insttute of Forensc Research, Cracow, Poland 2 Faculty of Chemstry, Jagellonan
More informationPSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12
14 The Ch-squared dstrbuton PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 1 If a normal varable X, havng mean µ and varance σ, s standardsed, the new varable Z has a mean 0 and varance 1. When ths standardsed
More informationProduct Quality and Safety Incident Information Tracking Based on Web
Product Qualty and Safety Incdent Informaton Trackng Based on Web News 1 Yuexang Yang, 2 Correspondng Author Yyang Wang, 2 Shan Yu, 2 Jng Q, 1 Hual Ca 1 Chna Natonal Insttute of Standardzaton, Beng 100088,
More informationBUSINESS PROCESS PERFORMANCE MANAGEMENT USING BAYESIAN BELIEF NETWORK. 0688, dskim@ssu.ac.kr
Proceedngs of the 41st Internatonal Conference on Computers & Industral Engneerng BUSINESS PROCESS PERFORMANCE MANAGEMENT USING BAYESIAN BELIEF NETWORK Yeong-bn Mn 1, Yongwoo Shn 2, Km Jeehong 1, Dongsoo
More informationTraditional versus Online Courses, Efforts, and Learning Performance
Tradtonal versus Onlne Courses, Efforts, and Learnng Performance Kuang-Cheng Tseng, Department of Internatonal Trade, Chung-Yuan Chrstan Unversty, Tawan Shan-Yng Chu, Department of Internatonal Trade,
More informationA Multi-mode Image Tracking System Based on Distributed Fusion
A Mult-mode Image Tracng System Based on Dstrbuted Fuson Ln zheng Chongzhao Han Dongguang Zuo Hongsen Yan School of Electroncs & nformaton engneerng, X an Jaotong Unversty X an, Shaanx, Chna Lnzheng@malst.xjtu.edu.cn
More informationCan Auto Liability Insurance Purchases Signal Risk Attitude?
Internatonal Journal of Busness and Economcs, 2011, Vol. 10, No. 2, 159-164 Can Auto Lablty Insurance Purchases Sgnal Rsk Atttude? Chu-Shu L Department of Internatonal Busness, Asa Unversty, Tawan Sheng-Chang
More informationDamage detection in composite laminates using coin-tap method
Damage detecton n composte lamnates usng con-tap method S.J. Km Korea Aerospace Research Insttute, 45 Eoeun-Dong, Youseong-Gu, 35-333 Daejeon, Republc of Korea yaeln@kar.re.kr 45 The con-tap test has the
More informationChapter 6. Classification and Prediction
Chapter 6. Classfcaton and Predcton What s classfcaton? What s Lazy learners (or learnng from predcton? your neghbors) Issues regardng classfcaton and Frequent-pattern-based predcton classfcaton Classfcaton
More informationA Load-Balancing Algorithm for Cluster-based Multi-core Web Servers
Journal of Computatonal Informaton Systems 7: 13 (2011) 4740-4747 Avalable at http://www.jofcs.com A Load-Balancng Algorthm for Cluster-based Mult-core Web Servers Guohua YOU, Yng ZHAO College of Informaton
More informationGaining Insights to the Tea Industry of Sri Lanka using Data Mining
Proceedngs of the Internatonal MultConference of Engneers and Computer Scentsts 2008 Vol I Ganng Insghts to the Tea Industry of Sr Lanka usng Data Mnng H.C. Fernando, W. M. R Tssera, and R. I. Athauda
More informationPredicting Software Development Project Outcomes *
Predctng Software Development Project Outcomes * Rosna Weber, Mchael Waller, June Verner, Wllam Evanco College of Informaton Scence & Technology, Drexel Unversty 3141 Chestnut Street Phladelpha, PA 19104
More informationLoad Balancing By Max-Min Algorithm in Private Cloud Environment
Internatonal Journal of Scence and Research (IJSR ISSN (Onlne: 2319-7064 Index Coperncus Value (2013: 6.14 Impact Factor (2013: 4.438 Load Balancng By Max-Mn Algorthm n Prvate Cloud Envronment S M S Suntharam
More informationSupport vector domain description
Pattern Recognton Letters 20 (1999) 1191±1199 www.elsever.nl/locate/patrec Support vector doman descrpton Davd M.J. Tax *,1, Robert P.W. Dun Pattern Recognton Group, Faculty of Appled Scence, Delft Unversty
More informationGRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM
GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM BARRIOT Jean-Perre, SARRAILH Mchel BGI/CNES 18.av.E.Beln 31401 TOULOUSE Cedex 4 (France) Emal: jean-perre.barrot@cnes.fr 1/Introducton The
More informationExhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation
Exhaustve Regresson An Exploraton of Regresson-Based Data Mnng Technques Usng Super Computaton Antony Daves, Ph.D. Assocate Professor of Economcs Duquesne Unversty Pttsburgh, PA 58 Research Fellow The
More informationFormation of probabilistic concepts through observations containing. discrete and continuous attributes.
Formaton of probablstc concepts through observatons contanng dscrete and contnuous attrbutes Rcardo Batsta Rebouças, João José Vasco Furtado Mestrado em Informátca Aplcada (MIA) Unversdade de Fortaleza
More informationDATA MINING CLASSIFICATION ALGORITHMS FOR KIDNEY DISEASE PREDICTION
DATA MINING CLASSIFICATION ALGORITHMS FOR KIDNEY DISEASE PREDICTION Dr. S. Vjayaran 1, Mr.S.Dhayanand 2, Assstant Professor 1, M.Phl Research Scholar 2, Department of Computer Scence, School of Computer
More informationSupport Vector Machines
Support Vector Machnes Max Wellng Department of Computer Scence Unversty of Toronto 10 Kng s College Road Toronto, M5S 3G5 Canada wellng@cs.toronto.edu Abstract Ths s a note to explan support vector machnes.
More informationClassification of Network Traffic via Packet-Level Hidden Markov Models
Classfcaton of Network Traffc va Packet-Level Hdden Markov Models Alberto Danott, Walter de Donato, Antono Pescapè Department of Computer Scence and Systems Unversty of Naples Federco II {alberto, walter.dedonato,
More informationCustomer Segmentation Using Clustering and Data Mining Techniques
Internatonal Journal of Computer Theory and Engneerng, Vol. 5, No. 6, December 2013 Customer Segmentaton Usng Clusterng and Data Mnng Technques Kshana R. Kashwan, Member, IACSIT, and C. M. Velu fronter
More informationImproved Mining of Software Complexity Data on Evolutionary Filtered Training Sets
Improved Mnng of Software Complexty Data on Evolutonary Fltered Tranng Sets VILI PODGORELEC Insttute of Informatcs, FERI Unversty of Marbor Smetanova ulca 17, SI-2000 Marbor SLOVENIA vl.podgorelec@un-mb.s
More informationHow To Classfy Onlne Mesh Network Traffc Classfcaton And Onlna Wreless Mesh Network Traffic Onlnge Network
Journal of Computatonal Informaton Systems 7:5 (2011) 1524-1532 Avalable at http://www.jofcs.com Onlne Wreless Mesh Network Traffc Classfcaton usng Machne Learnng Chengje GU 1,, Shuny ZHANG 1, Xaozhen
More informationProperties of Indoor Received Signal Strength for WLAN Location Fingerprinting
Propertes of Indoor Receved Sgnal Strength for WLAN Locaton Fngerprntng Kamol Kaemarungs and Prashant Krshnamurthy Telecommuncatons Program, School of Informaton Scences, Unversty of Pttsburgh E-mal: kakst2,prashk@ptt.edu
More informationJ. Parallel Distrib. Comput.
J. Parallel Dstrb. Comput. 71 (2011) 62 76 Contents lsts avalable at ScenceDrect J. Parallel Dstrb. Comput. journal homepage: www.elsever.com/locate/jpdc Optmzng server placement n dstrbuted systems n
More informationA Suspect Vehicle Tracking System Based on Video
3rd Internatonal Conference on Multmeda Technology ICMT 2013) A Suspect Vehcle Trackng System Based on Vdeo Yad Chen 1, Tuo Wang Abstract. Vdeo survellance systems are wdely used n securty feld. The large
More informationPerformance Management and Evaluation Research to University Students
631 A publcaton of CHEMICAL ENGINEERING TRANSACTIONS VOL. 46, 2015 Guest Edtors: Peyu Ren, Yancang L, Hupng Song Copyrght 2015, AIDIC Servz S.r.l., ISBN 978-88-95608-37-2; ISSN 2283-9216 The Italan Assocaton
More information7.5. Present Value of an Annuity. Investigate
7.5 Present Value of an Annuty Owen and Anna are approachng retrement and are puttng ther fnances n order. They have worked hard and nvested ther earnngs so that they now have a large amount of money on
More informationSemantic Content Enrichment of Sensor Network Data for Environmental Monitoring
Proceedngs of the Twenty-Seventh Internatonal Florda Artfcal Intellgence Research Socety Conference Semantc Content Enrchment of Sensor Network Data for Envronmental Montorng Dustn R. Franz and Rcardo
More informationEstimating the Number of Clusters in Genetics of Acute Lymphoblastic Leukemia Data
Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13 : 109-118 Estmatng the Number of Clusters n Genetcs of Acute Lymphoblastc Leukema Data Mahmoud K. Okasha, Khaled I.A. Almghar Department of
More informationStudy on Model of Risks Assessment of Standard Operation in Rural Power Network
Study on Model of Rsks Assessment of Standard Operaton n Rural Power Network Qngj L 1, Tao Yang 2 1 Qngj L, College of Informaton and Electrcal Engneerng, Shenyang Agrculture Unversty, Shenyang 110866,
More informationMachine Learning and Software Quality Prediction: As an Expert System
I.J. Informaton Engneerng and Electronc Busness, 2014, 2, 9-27 Publshed Onlne Aprl 2014 n MECS (http://www.mecs-press.org/) DOI: 10.5815/jeeb.2014.02.02 Machne Learnng and Software Qualty Predcton: As
More informationOn the Optimal Control of a Cascade of Hydro-Electric Power Stations
On the Optmal Control of a Cascade of Hydro-Electrc Power Statons M.C.M. Guedes a, A.F. Rbero a, G.V. Smrnov b and S. Vlela c a Department of Mathematcs, School of Scences, Unversty of Porto, Portugal;
More informationRecurrence. 1 Definitions and main statements
Recurrence 1 Defntons and man statements Let X n, n = 0, 1, 2,... be a MC wth the state space S = (1, 2,...), transton probabltes p j = P {X n+1 = j X n = }, and the transton matrx P = (p j ),j S def.
More information