Enhanced data mining analysis in higher educational system using rough set theory



Similar documents
International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

not possible or was possible at a high cost for collecting the data.

Big Data with Rough Set Using Map- Reduce

SPATIAL DATA CLASSIFICATION AND DATA MINING

A Review of Data Mining Techniques

How To Use Data Mining For Knowledge Management In Technology Enhanced Learning

Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

An Overview of Knowledge Discovery Database and Data mining Techniques

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

Data Mining Solutions for the Business Environment

A Rough Set View on Bayes Theorem

Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner

How To Improve Cloud Computing With An Ontology System For An Optimal Decision Making

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

DATA MINING TECHNIQUES AND APPLICATIONS

ORGANIZATIONAL KNOWLEDGE MAPPING BASED ON LIBRARY INFORMATION SYSTEM

Data Warehousing and Data Mining in Business Applications

Introduction to Data Mining

Data Mining System, Functionalities and Applications: A Radical Review

Building Data Cubes and Mining Them. Jelena Jovanovic

Introduction. A. Bellaachia Page: 1

Future Trend Prediction of Indian IT Stock Market using Association Rule Mining of Transaction data

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

Social Media Mining. Data Mining Essentials

International Journal of Advance Research in Computer Science and Management Studies

Why? A central concept in Computer Science. Algorithms are ubiquitous.

Foundations of Business Intelligence: Databases and Information Management

ETL PROCESS IN DATA WAREHOUSE

Healthcare Measurement Analysis Using Data mining Techniques

Data Mining: A Preprocessing Engine

Customer Classification And Prediction Based On Data Mining Technique

Machine Learning: Overview

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing

Database Marketing, Business Intelligence and Knowledge Discovery

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Introduction to Data Mining

A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH

Machine Learning and Data Mining. Fundamentals, robotics, recognition

ISSN: (Online) Volume 3, Issue 7, July 2015 International Journal of Advance Research in Computer Science and Management Studies

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

Data Mining for Business Analytics

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

K-Means Clustering Tutorial

How To Use Neural Networks In Data Mining

AMIS 7640 Data Mining for Business Intelligence

Data Warehouse: Introduction

CSP Scheduling on basis of Priority of Specific Service using Cloud Broker

DATA VERIFICATION IN ETL PROCESSES

Foundations of Business Intelligence: Databases and Information Management

Standardization and Its Effects on K-Means Clustering Algorithm

An Empirical Study of Application of Data Mining Techniques in Library System

Identifying the Number of Visitors to improve Website Usability from Educational Institution Web Log Data

Data Mining Algorithms Part 1. Dejan Sarka

Big Data: Rethinking Text Visualization

Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case Study: Qom Payame Noor University)

Data Mining is sometimes referred to as KDD and DM and KDD tend to be used as synonyms

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Course MIS. Foundations of Business Intelligence

Rule based Classification of BSE Stock Data with Data Mining

Data Mining Applications in Fund Raising

Information Management course

Chapter 23. Database Security. Security Issues. Database Security

COURSE RECOMMENDER SYSTEM IN E-LEARNING

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Inner Classification of Clusters for Online News

What is Data Mining? Data Mining (Knowledge discovery in database) Data mining: Basic steps. Mining tasks. Classification: YES, NO

Keywords Data Mining, Knowledge Discovery, Direct Marketing, Classification Techniques, Customer Relationship Management

ICSES Journal on Image Processing and Pattern Recognition (IJIPPR), Aug. 2015, Vol. 1, No. 1

Foundations of Business Intelligence: Databases and Information Management

Introduction to Data Mining Techniques

Data quality in Accounting Information Systems

Query Optimization Approach in SQL to prepare Data Sets for Data Mining Analysis

Chapter 20: Data Analysis

Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL

Data Mining Analytics for Business Intelligence and Decision Support

DATA WAREHOUSE E KNOWLEDGE DISCOVERY

Quality Control of National Genetic Evaluation Results Using Data-Mining Techniques; A Progress Report

EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH

Integrated Data Mining and Knowledge Discovery Techniques in ERP

Chapter 7: Data Mining

INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY DATA MINING IN HEALTHCARE SECTOR.

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

CHURN PREDICTION IN MOBILE TELECOM SYSTEM USING DATA MINING TECHNIQUES

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Visualization methods for patent data

Distance Learning and Examining Systems

Business Intelligence Using Data Mining Techniques on Very Large Datasets

Transcription:

African Journal of Mathematics and Computer Science Research Vol. 2(9), pp. 184-188, October, 2009 Available online at http://www.academicjournals.org/ajmcsr ISSN 2006-9731 2009 Academic Journals Review Enhanced data mining analysis in higher educational system using rough set theory P. Ramasubramanian 1 *, K. Iyakutti 2 and P. Thangavelu 3 1 Department of CSE, Dr.G.U. Pope College of Engineering, Sawyerpuram, India. 2 Dept. of Microprocessor and Computer, Madurai Kamaraj University, Madurai, India. 3 Department of Mathematics, Aditanar College of Arts and Science, Tiruchendur, India. Accepted 6 August, 2009 One of the biggest challenges that higher education faces today is predicting the behavior of students. Institutions would like to know, something about the performances of the students group wise. Behrouz Minaei-Bidgoli (2004) investigated a web based educational system using data mining techniques. He proposed a problem to investigate the performances of the students when the large data base of Students information system (SIS) is given. Generally students problems will be classified into different patterns based on the level of students like normal, average and below average. In this paper we attempt to analyze SIS database using rough set theory to predict the future of students. Key words: Data mining, knowledge discovery, rough set, lower approximation and upper approximation, characteristic sets, missing attribute values. INTRODUCTION Rough set theory is a fundamental area that has made important contributions to the Knowledge discovery in data base (KDD). Data mining is the process of autonomously extracting useful information or knowledge from large data stores or sets. It involves the use of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data sets. Data mining consists of more than collecting and managing data; it also includes analysis and prediction. These tools can include statistical models, mathematical algorithms and machine learning methods such as neural networks or decision trees etc by using rough set algorithms. Data mining (DM) can be performed on a variety of data stores, including the world wide web (WWW), relational databases, transactional databases, internal legacy systems, personal document format documents (PDF) and data warehouses. Data mining has been applied into many application domains such as Biomedical and DNA analysis, Retail industry and marketing, telecommunications, web mining, computer auditing, banking and insurance, fraud detec- *Corresponding author. E-mail: 2005.rams@gmail.com. Tel: +91 94883 54628. tion, financial industry, medicine and education. Data mining is popularly known as knowledge discovery databases. As the educational systems are capable of collecting large amount of students profile data, data mining and rough set techniques can be applied to find interesting relationships between attributes of students. Figure 1 shows the concept of data mining, which involves three steps: 1. Capturing and storing the data. 2. Converting the raw data into information. 3. Converting the information into knowledge. Data in this context comprises all the raw material an institution collects via normal operation. Capturing and storing the data is the first phase that is the process of applying mathematical and statistical formulas to mine the data warehouse. Mining the collected raw data from the entire institution may provide new information as to how students, parent s and the institutions own processes really perform. Converting the raw data into information is the second step of data mining. Our survey on the current works in data mining field shows that one of the application domains that can take advantage of data mining benefits in education. Our focus in this paper is data mining in higher educational

Ramasubramanian et al. 185 Figure 1. Data mining. system. Student Information System (SIS) data is involved with three kinds of large data sets: 1. Educational resources such as student databases, fees collection and individualized problems designed for use on assignments and examinations. 2. Information about users who create, modify, assess, or use these resources. 3. Activity log databases which log actions taken by students in behavioral characteristics and exam results. The objective of this paper is to predict the future of students who show the weak performance for the academic programmed, by accessing a group of students, likelihood of success, those who are graduating and not graduating and those who need additional training based on the features, which are extracted from their (and others) data. We design, implement and evaluate a series of pattern classifiers with various parameters in order to compare their performances in a real data set from the SIS system. To complete this task we require the knowledge of Rough set theory. Rough set logic The Rough set theory is a recent mathematical theory employed as a data mining tool with many favorable advantages. Since this theory has been applied to various domains, the majority of these applications are used to solve the classification problems, which exclude the temporal factor in data sets. The rough set analysis is presented as a technique to direct the knowledge discovery process from data. Pawlak (1982) introduced the Rough Set Theory which was initially developed for a finite universe of discourse in which the knowledge base is a partition, which is obtained by any equivalence relation defined on the universe of discourse. Let U be any finite universe of discourse. Let R be any equivalence relation defined on U. Clearly the equivalence relation partitions U. The pair (U, R) is called the approximation space. This collection of equivalence classes of R is called as knowledge base. Then for any subset A of U, the lower and upper approximations are defined respectively as follows: R A = RA = Wi { Wi : Wi A} { : A ϕ} Wi The ordered pair ( R A RA), is called rough set. In general, RA A RA. If R A = RA then A is called exact. The lower approximation of A is called the positive region of A and is denoted by POS(A) and the complement of upper approximation of A is called the negative region of A and is denoted by NEG(A). Its boundary is defined as BND(A) = RA RA. Hence, it is trivial that if BND(A) = ϕ, then A is exact. In the theory of rough sets, the decision table of any information system is given by T = (U, A, C, D), where U is the universe of discourse, A is a set of primitive features, C and D are the subsets of A, called condition and decision features respectively. For any subset P of A, a binary relation IND(P), called the indiscernibility relation is defined as IND(P) = {(x, y) U U : a( x) = a( y) for all a in P} The Rough set in A is the family of all subsets of U having the same lower and upper approximations. Generally Rough set theory is explored using the decision tables that are information tables. This table describes cases (also called objects or examples) using attribute values and a decision. Attribute values are independent denoted by a, while decision values are dependent denoted by d. Rows of the table are labeled as cases denoted by U. The set of all attributes is denoted by A. The following notations are used in this paper: U: Set of all cases. A: Set of all Attributes. V: Set of all attribute values. D: Decision values. B: Non-empty subset of A. Here P is a function defined from U A to V. Rough set analysis on SIS data Consider a completely specified student data as shown Table 1. In rough set theory it is called an information table. The table consists of three attributes of a student namely academic (A), non-academic (NA) and human behavior relationships (HBR). The value of the academic attribute is calculated on the basis of student performances in theory, practical, attendance, participating seminars, paper presentations, interactions, reading books and participating department activities. The value of the non-academic attribute is calculated on the basis of student performances in Sports, NSS, NCC, YRC and Social activities. The value of the Human behavior rela-

186 Afr. J. Math. Comput. Sci. Res. Table 1. SIS data set. S.No Academic Non-academic Human behavior relation Decision 1 0.8 0.6 0.8 Normal 2 0.7 0.4 0.9 Normal 3 0.5 0.4 0.4 Average 4 0.4 0.3 0.7 Below Average 5 0.3 0.4 0.9 Below Average 6 0.4 0.5 0.5 Below Average 7 0.8 0.6 0.8 Normal 8 0.8 0.6 0.8 Normal 9 0.8 0.3 0.8 Normal 10 0.3 0.2 0.4 Below Average 11 0.7 0.7 0.7 Normal Table 2. Characteristic Relation. a K x(a) K y(a) K z(a) K B(a) 1 {1,4,5,7,8,9} {1,3,9,10} {1,7,8,9,11} {1,9} 2 { 2,4,5,11} {1,2,3,4,5,9,10} {2,5,11} {2,5} 3 {3,4,5} {1,3,9,10} {3,11} {3} 4 {4,5} {1,2,3,4,5,9,10} {4,11} {4} 5 {4,5} {1,2,3,4,5,9,10} {2,5,11} {5} 6 {4,5,6} {1,3,6,9,10} {6,10,11} {6} 7 {1,4,5,7,8,9} {1,3,7,8,9,10} {1,7,8,9,11} {1,7,8,9} 8 {1,4,5,7,8,9} {1,3,7,8,9,10} {1,7,8,9,11} {1,7,8,9} 9 {1,4,5,7,8,9} {1,3,9,10} {1,7,8,9,11} {1,7,8,9} 10 {4,5,10} {1,3,9,10} {6,10,11} {10} 11 { 2,4,5,11} {1,3,9,10, 11} {11} {11} tionships is calculated on the basis of students relationship with their teachers, fellow students, public, family members and their performances in hobbies and entertainment. Now, we wish to analyze the following hypothetical sample information table using rough set theory. Learning from examples is the most important tool employed in data mining. We compute lower and upper approximations of all decisions in SIS data. By using boundary of the decisions (BND), we extract knowledge of the SIS database. SIS data for data mining are frequently affected by missing attributes values of Academic, Non-academic and Human behaviour relationships. Consider the following incomplete SIS decision table as shown in Table 3, we will assume that all unknown attributes values are denoted by *. Some attributes will be lost. Such values will be denoted by?. Computing characteristic relation The characteristic relation R(B) is known if we know characteristic sets K B (a) for all a U. Here; U = { 1,2,3,4,5,6,7,8,9,10,11} B = {x, y, z}. For each a U, K x (a) = { b U : p(a, x) = p(b, x) } {b U : p(b, x) = * }. K B (a) = K x (a)k y (a) K z (a). Using this formula we form the Table 2. Let (U, A) be an information system and let B A and X U. We can approximate X using the information contained in B by constructing the B-lower and B-upper approximations of X denoted by BX and BX respectively, where; BX = BX = { K B ( a) : KB( a) X} { KB ( a) : KB( a) X φ} The set BND(X) = BX BX is called the B-boundary re-

Ramasubramanian et al. 187 Table 3. SIS data set with missing attribute values. S. No Academic Non-academic Human behavior relation Result 1 0.8 * 0.8 Normal 2 0.7 0.4 0.9 Average 3? * 0.4 Average 4 * 0.4 0.7 Average 5 * 0.4 0.9 Average 6 0.4 0.5? Below Average 7 0.8 0.6 0.8 Normal 8 0.8 0.6 0.8 Normal 9 0.8 * 0.8 Normal 10 0.3 *? Below Average 11 0.7 0.7 * Average region of X, and this consists of those objects that we cannot decisively classify into X on the basis of knowledge in B. Example 1 Consider the two sets (decisions) X = {2, 3, 4, 5, 11} and Y = {1, 6, 7, 8, 9, 10}. From the above table, the B-lower and B-upper approximations of the two concepts (decisions) are: B X = {2, 3, 4, 5, 11} B X = {2, 3, 4, 5, 11} B Y = {1, 6, 7, 8, 9, 10} B Y = {1, 6, 7, 8, 9, 10} Boundary region of X is = {2,3,4,5,11} {2,3,4,5,11} = φ Boundary region of Y is ={1,6,7,8,9,10} {1,6,7,8,9,10}= φ Thus we conclude that, the set X and set Y have the same characteristics. Example 2 Consider the two sets (decisions) X= {1, 4, 5, 6, 8, 9} and Y = {2, 3, 7, 10, 11}. From the above table, the B-lower and B-upper approximations of the two concepts (decisions) are: B X = {1, 4, 5, 6, 9} B X = {1, 2, 4, 5, 6, 7, 8, 9} B Y = {3} B Y = {1, 2, 3, 5, 7, 8, 9, 10, 11} Boundary region of X is = {1, 2, 4, 5, 6, 7, 8, 9} {1, 4, 5, 6, 9} = {2, 7, 8} Boundary region of Y is = {1, 2, 5, 7, 8, 9, 10, 11} Hence, we conclude that, any characteristic of X is also the characteristics of Y. Example 3 Consider the two sets (decisions) X = {2, 4, 5, 6, 11} and Y= {1, 9, 11}. From the above table, the B-lower and B- upper approximations of the two concepts (decisions) are: B X = {2, 4, 5, 6, 11} B X = {2, 4, 5, 6, 11} B Y = {1, 9, 11} B Y = {1, 7, 8, 9, 11} Boundary region of x is = {2, 4, 5, 6, 11} {2, 4, 5, 6, 11} = φ Boundary region of y is = {1, 7, 8, 9, 11} {1, 9, 11} = {7, 8} From this, we conclude that, the characteristics of X and Y differ significantly. Algorithm Input: B, the subset of attributes Input: X, the subset of U, the decision space Output: Boundary of X 1. [Initialize] G: =U a U Lower(X) = Upper(X) = 2. Repeat step 3 to 8 while G. 3. If K B (a) X then Lower(X) = Lower(X) K B (a).

188 Afr. J. Math. Comput. Sci. Res. 4. If K B (a) X φ then Upper(X) = Upper(X) K B (a). 5. G = G {a} [End of the loop] 6. BND(X) = Upper(X) Lower(X). Data mining in higher educational system Our main contribution in this paper is addressing the capabilities and strengths of data mining technology in the context of higher educational system. Higher educational system can enhance their educational processes according to our proposed analysis model. We will present and describe an analysis model for this system. We analyze Student information system (SIS) database with the attributes such as academic, non-academic and human behavior relationships. These attributes can be further divided into corresponding sub attributes whose values will be analyzed by rough set logic. We will use these techniques to mine our SIS. Data are generally organized as matrix. The matrix contains a set of instances. Each matrix column represents the values associated to an attribute. Attributes can be for the student database for example: register number, name, branch of study, year, percentage of marks, academic, non-academic and human behavior relationships etc. An attribute can be either numerical or nominal. A nominal attribute contains values within a limited set of choices. A special attribute called decision attribute contains information about the decision to take upon the individual s consideration. The database consists of student s data collected from various departments. The advantage of this data is that it includes sufficient number of records of different types of students. A database consists of a set of tables containing either values of entity attributes or values of attributes from entity relationships. We use SQL data base. This is the most commonly used query language for relational database in SQL, which allows retrieval and manipulation of the data stored in the tables as well as the calculation of aggregate functions such as average, sum, min, max and count. Data mining can be benefited from SQL for data selection, transformation and consolidation. SQL provides predicting, correcting, comparing and detecting deviations etc. The various attributes are given in tabular column: Conclusion Thus we have analyzed the SIS database, using Rough set theory and discussed an algorithm to implement the technique. Using the technique discussed in this paper, one can access the performances of a particular student by accessing a group of students even if we are not aware of certain attribute values of the students. REFERENCES Behrouz Minaei-Bidgoli (2004). Data Mining For a Web Based Educational System, Ph.D Thesis Report. Department of Computer Science and Engineering, Michigan State University. Pawlak Z (1982). Rough Sets. Int. J. Comput.Info. Sci: 341 356.