A DATA MINING APPLICATION IN A STUDENT DATABASE



Similar documents
The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

A Secure Password-Authenticated Key Agreement Using Smart Cards

A Simple Approach to Clustering in Excel

FREQUENCY OF OCCURRENCE OF CERTAIN CHEMICAL CLASSES OF GSR FROM VARIOUS AMMUNITION TYPES

THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION

Forecasting the Direction and Strength of Stock Market Movement

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network *

Document Clustering Analysis Based on Hybrid PSO+K-means Algorithm

Gaining Insights to the Tea Industry of Sri Lanka using Data Mining

An Interest-Oriented Network Evolution Mechanism for Online Communities

Conversion between the vector and raster data structures using Fuzzy Geographical Entities

"Research Note" APPLICATION OF CHARGE SIMULATION METHOD TO ELECTRIC FIELD CALCULATION IN THE POWER CABLES *

Cluster Analysis. Cluster Analysis

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Data Mining from the Information Systems: Performance Indicators at Masaryk University in Brno

A DYNAMIC CRASHING METHOD FOR PROJECT MANAGEMENT USING SIMULATION-BASED OPTIMIZATION. Michael E. Kuhl Radhamés A. Tolentino-Peña

Mining Multiple Large Data Sources

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

An Alternative Way to Measure Private Equity Performance

Project Networks With Mixed-Time Constraints

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

Customer Segmentation Using Clustering and Data Mining Techniques

Credit Limit Optimization (CLO) for Credit Cards

Improved SVM in Cloud Computing Information Mining

Methodology to Determine Relationships between Performance Factors in Hadoop Cloud Computing Applications

A Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression

A Comparative Study of Data Clustering Techniques

Using an Ordered Probit Regression Model to Assess the Performance of Real Estate Brokers

Course outline. Financial Time Series Analysis. Overview. Data analysis. Predictive signal. Trading strategy

METHODOLOGY TO DETERMINE RELATIONSHIPS BETWEEN PERFORMANCE FACTORS IN HADOOP CLOUD COMPUTING APPLICATIONS

Automated information technology for ionosphere monitoring of low-orbit navigation satellite signals


RESEARCH ON DUAL-SHAKER SINE VIBRATION CONTROL. Yaoqi FENG 1, Hanping QIU 1. China Academy of Space Technology (CAST)

Ants Can Schedule Software Projects

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

Enterprise Master Patient Index

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation

Performance Analysis of Energy Consumption of Smartphone Running Mobile Hotspot Application

Data Visualization by Pairwise Distortion Minimization

An MILP model for planning of batch plants operating in a campaign-mode

A Novel Adaptive Load Balancing Routing Algorithm in Ad hoc Networks

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP)

A Dynamic Load Balancing for Massive Multiplayer Online Game Server

Can Auto Liability Insurance Purchases Signal Risk Attitude?

SPECIALIZED DAY TRADING - A NEW VIEW ON AN OLD GAME

Single and multiple stage classifiers implementing logistic discrimination

Power-of-Two Policies for Single- Warehouse Multi-Retailer Inventory Systems with Order Frequency Discounts

A Revised Received Signal Strength Based Localization for Healthcare

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

Support Vector Machines

Brigid Mullany, Ph.D University of North Carolina, Charlotte

Calculating the high frequency transmission line parameters of power cables

ADVERTISEMENT FOR THE POST OF DIRECTOR, lim TIRUCHIRAPPALLI

Software project management with GAs

A Fast Incremental Spectral Clustering for Large Data Sets

Cloud-based Social Application Deployment using Local Processing and Global Distribution

Joint Scheduling of Processing and Shuffle Phases in MapReduce Systems

ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING

LITERATURE REVIEW: VARIOUS PRIORITY BASED TASK SCHEDULING ALGORITHMS IN CLOUD COMPUTING

PAS: A Packet Accounting System to Limit the Effects of DoS & DDoS. Debish Fesehaye & Klara Naherstedt University of Illinois-Urbana Champaign

Analysis of Energy-Conserving Access Protocols for Wireless Identification Networks

What is Candidate Sampling

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

Mining Feature Importance: Applying Evolutionary Algorithms within a Web-based Educational System

Efficient Project Portfolio as a tool for Enterprise Risk Management

How To Classfy Onlne Mesh Network Traffc Classfcaton And Onlna Wreless Mesh Network Traffic Onlnge Network

Reinforcement Learning for Quality of Service in Mobile Ad Hoc Network (MANET)

Risk Model of Long-Term Production Scheduling in Open Pit Gold Mining

Automated Mobile ph Reader on a Camera Phone

Gender Classification for Real-Time Audience Analysis System

RELIABILITY, RISK AND AVAILABILITY ANLYSIS OF A CONTAINER GANTRY CRANE ABSTRACT

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

M3S MULTIMEDIA MOBILITY MANAGEMENT AND LOAD BALANCING IN WIRELESS BROADCAST NETWORKS

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

A Data Mining-Based OLAP Aggregation of. Complex Data: Application on XML Documents

Damage detection in composite laminates using coin-tap method

An Adaptive and Distributed Clustering Scheme for Wireless Sensor Networks


Research on Engineering Software Data Formats Conversion Network

A heuristic task deployment approach for load balancing

Vehicle Detection and Tracking in Video from Moving Airborne Platform

IT09 - Identity Management Policy

ERP Software Selection Using The Rough Set And TPOSIS Methods

A Performance Analysis of View Maintenance Techniques for Data Warehouses

Proceedings of the Annual Meeting of the American Statistical Association, August 5-9, 2001

An Analysis of Dynamic Severity and Population Size

Politecnico di Torino. Porto Institutional Repository

Simple Interest Loans (Section 5.1) :

A New Task Scheduling Algorithm Based on Improved Genetic Algorithm

Human Tracking by Fast Mean Shift Mode Seeking

A Multi-Camera System on PC-Cluster for Real-time 3-D Tracking

Transcription:

JOURNAL OF AERONAUTICS AND SPACE TECHNOLOGIES JULY 005 VOLUME NUMBER (53-57) A DATA MINING APPLICATION IN A STUDENT DATABASE Şenol Zafer ERDOĞAN Maltepe Ünversty Faculty of Engneerng Büyükbakkalköy-Istanbul senole@maltepe.edu.tr Mehpare TİMOR İstanbul Ünversty, Faculty of Busness Admnstraton Avcılar-Istanbul tmorm@stanbul.edu.tr ABSTRACT Data mnng s a technology used n dfferent dscplnes to search for sgnfcant relatonshps among varables n large data sets. Data mnng s manly used n commercal applcatons. In ths study, we concentrated on the applcaton of data mnng n an educaton envronment. The relatonshp between students unversty entrance examnaton results and ther success was studed usng cluster analyss and k-means algorthm technques. Keywords: Data mnng, Cluster Analyss, K-Means Algorthm. 1. INTRODUCTION The amount of data mantaned n an electronc format has seen a dramatc ncrease n recent tmes. The amount of nformaton doubles every 0 months, and the number of databases s ncreasng at an even greater rate [1,]. The search to determne sgnfcant relatonshps among varables n the data has become a slow and subectve process. As a possble soluton to ths problem, the concept of Knowledge Dscovery n Databases KDD has emerged [3]. The process of the formaton of sgnfcant models and assessment wthn KDD s referred to as data mnng [,4]. Data mnng s used to uncover hdden or unknown nformaton that s not apparent, but potentally useful [,5]. Cluster analyss s a technque used n data mnng. Cluster analyss nvolves the process of groupng obects wth smlar characterstcs [6], and each group s referred to as a cluster. Cluster analyss s used n varous felds, such as marketng, mage processng, geographcal nformaton systems, bology, and genetcs. In ths study, unversty students were grouped accordng to ther characterstcs, formng clusters. The clusterng process was carred out usng a K- means algorthm.. CLUSTER ANALYSIS Cluster analyss s a multvarate analyss technque where ndvduals wth smlar characterstcs are determned and classfed (grouped) accordngly [,7]. Through cluster analyss, dense and sparse regon can be determned n the dstrbuton, and dfferent dstrbuton patterns may be acheved. The concepts of smlartes and dfferences are used n cluster analyss. Dfferent measures may be used n determnng smlartes and dfferences. Ths study utlses the Eucldan dstance measure..1. Eucldan Dstance Measure The Eucldan dstance measure s frequently used as a dstance measure, and s easy to use n two dmensonal planes. As the number of dmensons ncreases, the calculablty tme also ncreases []. d (, ) = ( x + x +... + x 1 1 p p ) (.1.a) 53

The formula defnes data obects and wth a number of dmenson equal to p. The dstance between the two data obects d(,) s expressed as gven n formula (.1.a). x p : s the measurement of obect n dmenson p... Algorthm The K-means algorthm s a cluster analyss algorthm used as a parttonng method, and was developed by MacQueen n 1967 [8]. K-means s the most wdely used used and studed clusterng algorthm. Gven a set of n data ponts n real d-dmensonal space, R d, and an nteger k, the problem s to determne a set of k ponts n R d, called centers, so as to mnmze the mean squared dstance from each data pont to ts nearest center[9]. The K-means algorthm defnes a random cluster centrod accordng to the ntal parameters [8]. Each consecutve case s added to the cluster accordng to the proxmty between the mean value of the case and the cluster centrod. The clusters are then re-analysed to determne the new centrod pont. Ths procedure s repeated for each data obect. The algorthm s composed of the followng steps:[10] 1. Place K ponts nto the space represented by the obects that are beng clustered. These ponts represent ntal group centrods.. Assgn each obect to the group that has the closest centrod. 3. When all obects have been assgned, recalculate the postons of the K centrods. 4. Repeat Steps and 3 untl the centrods no longer move. Ths produces a separaton of the obects nto groups from whch the metrc to be mnmzed can be calculated. 3. APPLICATION In ths study, data gathered from unversty students was analysed usng a k-means algorthm cluster analyss technque. 3.1. Data Set The data gathered from the students of the Maltepe Unversty was used n ths study. The data was gathered n 003, and ncluded records of 7 students. 3.. Database The database management system used n the study was the Mcrosoft SQL Server 000. Ths system was used for two reasons; the software used n analyss was compatble and effcent to use wth the database management system, and the data to be analysed was mantaned n the database pror to the study. 3.3. Applcaton Software The programmng envronment for the applcaton was Matlab. The Matlab software applcaton was sutable for the development of the applcaton, and compatble wth the SQL Server 000 n whch the data was mantaned. The K-means algorthm used n the applcaton was defned n the Matlab software as a functon. 3.4. The Data Mnng Process The data exploraton and presentaton process conssted of varous steps. These steps were data preparaton, data selecton and transformaton, data mnng and presentaton. 3.4.1. Data Preparaton In these steps, the data that was mantaned n dfferent tables was oned n a sngle table. The students and students_log tables were oned usng the StudentsID feld as the key feld. After the onng process errors n the data were corrected. Fgure 1. Students and Students Grades Tables 54

3.4.. Data Selecton and Transformaton After the data preparaton, the data selecton and transformaton process was performed. In ths step the felds used n the study were determned and transformed f necessary. For example, the felds n whch the responses were yes/no were transformed to 1/0. The AreaPontPercent, SuccessGrade, SexCode, HghSchoolTypeID and FacultyID felds were selected for use n the study, and a new table was created. AreaPontPercent was the percentle the student fell nto n the unversty entrance exam, and SuccessGrade was the grade they obtaned. The SexCode varable was coded 5 and 10 for male and female respectvely. The HghSchoolTypeID vared between 1 and 10 and the FacultyID vared between 1 and 6. 3.4.3. Data Mnng The prepared data was then put through the data mnng process. The K-means algorthm was used n ths step. The number of clusters was determned as an external parameter. Dfferent cluster numbers were tred, and a successful parttonng was acheved wth 5 clusters. The cluster centrods are gven n table 1. Table 1. Cluster Centrods Cluster AreaPontPercent SuccessGrade Gender HghSchoolTypeID FacultyID 1 1.4774 89.5350 8.1070 6.4650.5844 16.3113 59.047 6.9811 5.1981 3.0189 3 46.7851 77.140 6.9008 5.7190 3.149 4 80.1095 78.0146 6.6788 3.774 3.831 5 79.3565 44.8870 6.3043 3.1391 4.1304 3.4.4. Presentaton The results of the data mnng step are presented n ths step. For graphs and tables, the MapToolBox plug-n of the Matlab software was used. The resultng clusters are shown n fgure. Fgure. Each number represents dfferent cluster 55

The fgure above shows the Unversty Entrance Exam percentles of the x axs, and the grades on the y access. The graph shows that the 1 st cluster s more successful n regard to grades whle the 5 th group s the least successful. The dstrbuton of facultes n these two clusters s shown n fgures 3 and 4. Fgure 3. Dstrbuton of Cluster 1 Fgure 4. Dstrbuton of Cluster 5 The maorty of the students n cluster 1 are from the Faculty of Arts and Scences. The reason s that most students n the faculty have hgh success grades and scholarshps. They study hard to keep ther scholarshp, and therefore have good grades. Cluster 5, however, s manly made up of Faculty of Communcaton and Faculty of Busness Scences students. These students have lower grades and lower results n the unversty entrance exam. 4. CONCLUSION Ths study utlses data mnng n the feld of educaton. Cluster analyss and K-means analyss were used as data mnng technques. The steps of the data mnng process were carred out and explaned n detal. The area of applcaton was educaton, dfferent from the usual data mnng studes. The use of the data mnng technque n educaton may provde us wth more vared and sgnfcant fndngs, and may lead to the ncrease n the qualty of educaton. 5. REFERENCES [1] Vahaplar, A.,İnceoğlu, M., Ver Madenclğ ve Elektronk Tcaret, Türkye de İnternet Konferansları, Harbye İstanbul, 1-3 Kasım 001. [] Erdoğan, Ş. Z., Ver Madenclğ ve Ver Madenclğnde Kullanılan K-Means Algortmasının Öğrenc Ver Tabanında Uygulanması, Yüksek Lsans Tez, İstanbul Ünverstes, 004. [3] Akpınar, H., Ver Tabanlarında Blg Keşf ve Ver Madenclğ, İ.Ü. İşletme Fakültes Dergs, Sayı:1 (1-), Nsan 000. [4] Thearlng, K., An Introducton to Data Mnng,http://thearlng.com/text/dmwhte/dm whte.htm, 01 December 003. [5] Fayyad, U.M., Patesky-Shapro, G., Smyth, P., Uthurusamy, R., Advances n data mnng and knowledge dscovery, MIT Pres, USA, 1994. [6] Han, J., Kamber, W., Data Mnng Concepts and Technques, Morgan Kaufmann Publshers, USA, 5-10, 001. [7] Menteş, G. T., Faktör ve Kümeleme Analz Yardımıyla Bankacılık Ürün ve Hzmetlernn Araştırılması Üzerne Br Uygulama, Doktora Tez, İstanbul Ünverstes, 000. [8] Yuqng, P., Xangdan, H., Shang, L., The K- Means Clusterng Algorthm Based On Densty and Ant Colony, IEEE Int. Conf. Neural Networks & Sgnal Processng Nanng, Chna, 457-460, December 14-17, 003. [9] Kanungo, T., Mount, D., S. Netanyahu, N., Patko D. C., Slverman, R., Wu, A.Y., An Effcent k-means Clusterng Algorthm: Analyss and Implementaton, IEEE Transactons on Pattern Analyss and Machne Intellgence, Vol. 4, No. 7, July 00. [10] Luke, B. T., K-Means Clusterng, http://fconyx.ncfcrf.gov/~lukeb/kmeans.html, 0 October 004. 56

VITA Şenol Zafer ERDOĞAN He graduated from Computer Engneerng at Trakya Unversty n 001.He receved hs Msc degree n Istanbul Unversty n uly 004. He oned Computer Engneerng Department at Maltepe Unversty n 001. He s now a research assstant at Maltepe Unversty. Mehpare TİMOR She graduated from Busness Admnstraton at Istanbul Unversty n 1986. She receved her Msc degree n 1988 and she receved her Phd degree at Istanbul Unversty n 1993. She s now a assstant professor at Istanbul Unversty Faculty of Busness Admnstraton. 57