DATA MINING AND STATISTICAL METHODS USED FOR SCANNING CATEGORICAL DATA. Murtadha M Hamad Al-Anbar University-College of computers

Similar documents
Vol. 35, No. 3, Sept 30,2000 ملخص تعتبر الخوارزمات الجينية واحدة من أفضل طرق البحث من ناحية األداء. فبالرغم من أن استخدام هذه الطريقة ال يعطي الحل

Is it statistically significant? The chi-square test

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Fairfield Public Schools

Data Mining: An Overview. David Madigan

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

DATA MINING TECHNIQUES AND APPLICATIONS

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Testing Research and Statistical Hypotheses

Least Squares Estimation

CLASSIFYING SERVICES USING A BINARY VECTOR CLUSTERING ALGORITHM: PRELIMINARY RESULTS

Additional sources Compilation of sources:

Chapter 4. Probability and Probability Distributions

Evaluating the Area of a Circle and the Volume of a Sphere by Using Monte Carlo Simulation. Abd Al -Kareem I. Sheet * E.mail: kareem59y@yahoo.

DHL Data Mining Project. Customer Segmentation with Clustering

Easily Identify Your Best Customers

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Chi-square test Fisher s Exact test

Text Analytics. A business guide

Association Between Variables

Social Media Mining. Data Mining Essentials

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Statistics Graduate Courses

How To Use Neural Networks In Data Mining

3/17/2009. Knowledge Management BIKM eclassifier Integrated BIKM Tools

Information Management course

Data Mining Algorithms Part 1. Dejan Sarka

A Data Mining Study of Weld Quality Models Constructed with MLP Neural Networks from Stratified Sampled Data

Big Data: Rethinking Text Visualization

Protein Protein Interaction Networks

Simulating Chi-Square Test Using Excel

ORGANIZATIONAL KNOWLEDGE MAPPING BASED ON LIBRARY INFORMATION SYSTEM

White Paper April 2006

Section 12 Part 2. Chi-square test

RMTD 404 Introduction to Linear Models

Enhanced Boosted Trees Technique for Customer Churn Prediction Model

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

CHAPTER 14 ORDINAL MEASURES OF CORRELATION: SPEARMAN'S RHO AND GAMMA

REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION

Analyzing Research Data Using Excel

Linear Models in STATA and ANOVA

An Algorithm for Data Hiding in Binary Images. Eman Th. Sedeek Al-obaidy Veterinary Medicine College University of Mosul

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Implications of Big Data for Statistics Instruction 17 Nov 2013

A Divided Regression Analysis for Big Data

Analysing Questionnaires using Minitab (for SPSS queries contact -)

Introduction to Data Mining

Chapter 20: Data Analysis

Data Mining: Overview. What is Data Mining?

Pentaho Data Mining Last Modified on January 22, 2007

Abbas S. Tavakoli, DrPH, MPH, ME 1 ; Nikki R. Wooten, PhD, LISW-CP 2,3, Jordan Brittingham, MSPH 4

How To Cluster

The Cost of Annoying Ads

MBA 611 STATISTICS AND QUANTITATIVE METHODS

Application of Predictive Model for Elementary Students with Special Needs in New Era University

MBA Data Mining & Knowledge Discovery

A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic

Simple Regression Theory II 2010 Samuel L. Baker

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA

not possible or was possible at a high cost for collecting the data.

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

Standardization and Its Effects on K-Means Clustering Algorithm

CLUSTER ANALYSIS. Kingdom Phylum Subphylum Class Order Family Genus Species. In economics, cluster analysis can be used for data mining.

Model-Based Cluster Analysis for Web Users Sessions

Learning is a very general term denoting the way in which agents:

Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH

Calculating, Interpreting, and Reporting Estimates of Effect Size (Magnitude of an Effect or the Strength of a Relationship)

Survival Analysis of the Patients Diagnosed with Non-Small Cell Lung Cancer Using SAS Enterprise Miner 13.1

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Hexaware E-book on Predictive Analytics

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

A Stock Pattern Recognition Algorithm Based on Neural Networks

Comparing Methods to Identify Defect Reports in a Change Management Database

January 26, 2009 The Faculty Center for Teaching and Learning

Introduction to Analysis of Variance (ANOVA) Limitations of the t-test

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Prediction of Heart Disease Using Naïve Bayes Algorithm

STATISTICS FOR PSYCHOLOGISTS

What is Data Mining? Data Mining (Knowledge discovery in database) Data mining: Basic steps. Mining tasks. Classification: YES, NO

Introduction to Longitudinal Data Analysis

Bill Burton Albert Einstein College of Medicine April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL

Decision Support System For A Customer Relationship Management Case Study

Grid Density Clustering Algorithm

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Practical Calculation of Expected and Unexpected Losses in Operational Risk by Simulation Methods

Transcription:

ISSN: 1991-8941 DATA MINING AND STATISTICAL METHODS USED FOR SCANNING CATEGORICAL DATA Murtadha M Hamad Al-Anbar University-College of computers Received: 1/5/007 Accepted: 30/8/007 ABSTRACT: It has been shown that data mining uncovers patterns in data using predictive techniques. These patterns play a critical role in decision making because they reveal areas for process improvement. Statistical techniques such as Chi-square test for association are widely used in the medical field. Yet, the interpretation of some of the results approached by the use of this statistical techniques is seems to be a very difficult task. The type of association is often non-linear and hence will mask the important part of the use of this technique. In this research work a new approach is adopted by scanning the raw data for any possible association (linear or non-linear). More data mining methods and statistical inference were the base tools of this research work. Keywords: Chi-square test, Data mining, Kappa measures, Linear and non-linear association, Categorical data. 1. INTRODUCTION In the cross-classification of categorical data, researchers are always interested in the sense of searching the cross-classification table for any potential relation ship between groups of the cross classified variables. Such a relationship is statistically denoted as an association. Data mining often concerns with the meaning and quality of the information embedded in any given set of data. Ideas from information measurements and statistical analysis will be merged in order to establish a linkage between these two tools [1]. Such a linkage will allow the users to handle a straightforward interpretation as to unmask the type and degree of association, and hence will enhance the meaning of the results obtained. Most analysts separate data mining software into two groups: data mining tools and data mining applications. Data mining tools provide a number of techniques that can be applied to any business problem. Regardless of whether we are aware of them, our daily lives are influenced by data mining

applications. For example, almost every financial transaction is processed by a data mining application to detect fraud. Both data mining tools and data mining applications are valuables, however. Increasingly organizations as data mining tools and data mining applications together in a integrated environment for predictive analytic. Assume a concept of event patterns as an embodiment of information. Consider a set of mutually exclusive random variables {Xi : i = 1 k}. an instantiation of any sub set of variables in X referred to as an event pattern. In this research, light will be shed only on discrete, finite, multi-valued random variables. Using multi-valued discrete variables to represent a physical phenomenon, a concept, or an object, is common in a variety of fields such as medicine, business and economics. For example, in a medical diagnosis problem, gender and condition may be two variables of interest. A particular patient always has one and only one gender, meanwhile could be a located to none, one or more disease(s) []. The concept of data mining passed upon data patterns is to identify events patterns that are either statistically significant or not. One approach towards identifying statistical significant information is passed on event association [3]. Significant association may be determined by statistical hypothesis test passed on mutual information measure or residual analysis. as reported elsewhere, mutual information with regard to information theory is asymptotically distributed as chisquare distribution. this result has been extended elsewhere [4] to model residual analysis as a normally distributed random variable. In doing so, statistical hypothesis test passed on residual analysis may be used as a conceptual tool to discover data patterns with significant event associations. Another results discovered recently [5] is an algebraic linkage between information measure and statistical analysis that suggests yet another approach for detecting events association passed on symbol probability ratio..data Handling and Algorithm as to clarify the algorithm with real set of data that were part of the data set collected by holmquest et al [6] in an investigation into observer reliability in the histological classification of carcinoma in situ and related lesions of the uterine cervix [7], will be used (table 1). 1: negative : A typical squamous 3: Carcinoma in situ 4: Squamous carcinoma with early stromal invasion. 5: Invasive carcinoma The algorithm

Before they are plunged into the process of calculating the chisquare statistics and its relevant kappa [8] it is useful to state the gradual steps of the computer algorithm from the beginning till the handling of the decision statement. The following steps illustrate the detailed algorithm: 1. A database containing information about patients relevant to this research must be available. In this context a virtual medical database has been prepared to handle the real data considered in this research. A table containing the required in formation must be identified. In this context a virtual table has been designed to involve the following fields: - Patients number - Age - Sex - Date of admission - Complain - Symptoms - Mass - Pathologist 1 - Pathologist The table contained information about the real patients as posed by the example mentioned in everitt B. S. [6]. 3. Cases with negative mass findings has been ignored in the study. Only those cases with positive mass findings were involved. This procedure has been done by the use of filter statement available within the Microsoft database program. 4. A different procedure has been implemented to handle the cross-classification table of beliefs from pathologist 1 and (Table 1). 5. A crystal report containing the cross classified table and the results of both chi-square statistics and kappa has been designed. The report also contained a decision statement the magnitude of kappa based on the comparison of the calculated value of kappa with its theoretical range of values. The detailed procedure for calculating chi-square test and kappa is given according to the following: a. The value of Chi-sqaure statistics calculation: m k ( Oy E ) y x = i = 1 j = 1 E y where m is the number of rows and k is the number of columns for the cross classified table: O is the observed frequency and E is the expected frequency calculated by multiplying the corresponding row and column totals and divide the result by the grand total. b. Calculation kappa : According to the data of table 1, the following components are going to be calculated: n + 7+ 36+ 7+ 3 p = C u / T= i= 1 118 0 = 0.635

p 0 n ( TiTn / T ) 100 = = = % i 1 6* 7 6*1 + + 118 118 38*69 118 + *7 118 + 6*3 118 Kappa = P P O 0 1 P0 = 0.635 0.3 = 0.47 1 0.3 100 % = 0.3 Therefore In order to calculate the variance of kappa, the proportion of each entry of table 1 will be divided by 118 (table ) 1 P n u Var( Kappa) = 4 1 ( 1 0) i = T P + Sd = Var( Kappa) = 0. 06 [( 1 P ) ( P P )( 1 P )] c i j n n ( 1 P0 ) Pn ( Pi + Pj ) ( P0 Pc Pc + P0 ) i= 1 j= 1 0 = 0.0036 95% confidence intervals=kappa ±1.98sd / T =(0.33,0.61) 3.Data model In this research, a data table has been done which contains the various number of the different diagnosis cases and the number of the cases dealt with (118) as the shown in table (1). As figure 1 show: 1. Some statistical tools were used as helping devices, such as chi-square statistics and kappa metric in table (1) for obtaining table (). Then, the standard deviation was calculated.. A database under the name (pathologists.mdb) was built and designed, which contains (118) records corresponding with the cases. The data base is expandable to include further number of records. Filtering of the data base records has been carried out in order to deal with the cases of positive mass under study. The results of 1 and, and the use of the table below [8]:

Were conducive to a report containing the essential information which has a role in making the decision that leads to diagnosis of the infection level of the studied cases. 4.Discussion Results In this paper, 118 cases of factual data have been dealt with, as shown in table (1) and data base (pathologists.mdb). After examining the statistical concepts, it has been noticed that using the weighted Kappa Metric is important in satisfactorily classifying and partitioning the data groups of the above data base according to the studied cases. it has also been noticed that Kappa Metric has an active role and important indicator relative to the data observed by ( pathologists). Despite the complexities that accompany the calculation, the role of Kappa Metric is greater when the observes are more than two. From algorithm (step 5) we got the following: Sd = Var( Kappa) = 0.06 95%Confidenceinmtervals = Kappa ± 1.98Sd / T = ( 0.33,0.61) From table 3 we got the optimal value (0.47) of strength agreement: Moderate Kappa= 0.47 and hence a moderate linear association is detected This messages one of six possible that the program may revealed according to the value of Kappa as stated in table 3. The message motional below the crystal report is the actual output of the program according to the value of Kappa (0.47). The 95% confidence intervals indicated that the value of Kappa will never be out of range (0.33, 0.61). 5.Conclusion The study has reached the following conclusions: 1- Using the statistical concepts and tools as supporting tools in dealing with data mining has a role in uncovering data which are not easily revealed by normal methods. - Several interesting results are found in this research. First, the concept of data patterns allows us to visualize any early step of data mining has being a process of finding significant event associations. This process lends itself to a set of data patterns for discovering an inference model; where such a model encapsulates significant behavior of the data as measured by statistical analysis as well as information measure. 3- Data mining uncovers patterns in data using predictive techniques. These patterns play a critical role in decision making because they reveal areas for processes improvement. Using data mining, organizations can increase the profitability of their interactions with customers, detect fraud, and improve risk management. The

patterns uncovered using data mining help organizations make better and timelier decisions. 6.Refernces [1] Sy B. K., "Pattern-based Inference Approach for Data Mining", Queen College, Department of Computer Science, 1999. [] Sy B. K. & D. Sher, "An abstraction Theory framework for Probabilistic Inference", Proc. Of the workshop on spatial and temporal Interaction, Nov. 1994. [3] Goldstein D, Ghosh D, Conlon E, "Statistical issues in the clustering of gene expression data", 00, 1:19-41. [4] Wrong A.K.C., "High order pattern discovery from Discrete valued Data", IEEE Trans. On Knowledge and Data Engineering, 9(6):877-839,1997. [5] Sy B. K., "Informationtheoretical and statistical approaches for independence Test ",Proc. Of the international S-PLUS Users Conference, published by Mathsoft Inc., Washington D.C., Oct 1998. [6] N.D. Holmquist & O.D. Williams "Variabililty in classification of carcinoma in Situ of the Uterine Cervix", Archives of pathology, 1967. [7] B.S. Everitt, "statistical methods for medical Investigations", First published in great Britain 1989. [8] S. Swift, "Consensus clustering and functional interpretation of gene-expression data ",Genome Biology, Nov. 004.

Create & Design (pathologist.mdb) using DB program Create table 1 with observed frequencies Statistical processing (preprocessing) calculating Chi-square Kappa measures Obtain table Using weighted-kappa guideline Report with decision statement Report with decision statement End Figure 1: Suggested Data Model for current work

Table 1: observed frequencies of biopsy slides classified by two pathologist according to most involved lesion of the uterine cervix. Pathologist Pathologist 1 11 33 44 55 Total Ti 1 7 00 00 6 55 7 114 00 00 6 3 00 1 336 00 00 38 4 00 01 114 77 00 5 00 00 33 00 33 6 Total T 7 11 669 77 33 118 T Table : proportion of each entry as compared to the grand total. 1 3 4 5 T i 1 0.186 0.017 0.017 0.000 0.000 0.0 0.04 0.059 0.119 0.000 0.000 0.0 3 0.000 0.017 0.305 0.000 0.000 0.3 4 0.000 0.008 0.119 0.059 0.000 0.186 5 0.000 0.000 0.05 0.000 0.05 0.051 T j 0.9 0.10 0.0585 0.059 0.05 Table 3: Evaluation of observed Kappa values Weighted_kappa Strength of agreement 0.00 Poor 0.00-0.0 Slight 0.1-0.40 Fair 0.41-0.60 Moderate 0.61-0.80 Substantial 0.81-1.00 Almost perfect

الخلاصة طرق التحري والا حصاء المستخدمة في مسح البيانات الفي وية مرتضى محمد حمد E. mail : mortadha61@yahoo.com لقد تبين ان اسلوب التحري عن البيانات يكشف لنا الكثير من الانماط الغير المعروفة والتي لها دورها الهام في عملية صنع القرار. ان هذه الانماط يمكن ان تلعب دورا مهما في ازالة الستار عن المساحات التي يمكن من خلالها تحسين عملية صنع القرار. التقنية الاحصاي ية كا ختبار chi-square لتحديد نوع الاقتران كثيرة الاستعمال في الحقل الطبي. رغم ذلك تفسير البعض من النتاي ج التي جرى حسابها باستعمال هذه التقنية الاحصاي ية تبدو مهمة صعبة جدا. ان نوع الاقتران في اغلب الاحيان يكون لاخطيا ولذلك سيخفي الجزء المهم لاستعمال هذه التقنية. في هذا البحث تم التعامل باسلوب جديد من خلال مسح البيانات الاولية لاي اقتران محتمل (خطي او لاخطي). طريقة التحري عن البيانات والاستدلال الاحصاي ي تعتبر من الادوات الاساسية لهذا البحث.