Predictive Analytics on Student Academics [1]



Similar documents
Homework 11. Part 1. Name: Score: / null

Data Mining Solutions for the Business Environment

Enhancing Education Quality Assurance Using Data Mining. Case Study: Arab International University Systems.

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Course Syllabus. Purposes of Course:

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Automatic Student Performance Analysis and Monitoring

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

Customer Classification And Prediction Based On Data Mining Technique

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Simple Predictive Analytics Curtis Seare

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

A Framework for Dynamic Faculty Support System to Analyze Student Course Data

Edifice an Educational Framework using Educational Data Mining and Visual Analytics

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

The Correlation Coefficient

COMMON CORE STATE STANDARDS FOR

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

COURSE RECOMMENDER SYSTEM IN E-LEARNING

CSU, Fresno - Institutional Research, Assessment and Planning - Dmitri Rogulkin

An Overview of Knowledge Discovery Database and Data mining Techniques

Search Result Optimization using Annotators

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing

2. Simple Linear Regression

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

IT services for analyses of various data samples

430 Statistics and Financial Mathematics for Business

Social Media Mining. Data Mining Essentials

The Big Picture. Correlation. Scatter Plots. Data

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

table to see that the probability is (b) What is the probability that x is between 16 and 60? The z-scores for 16 and 60 are: = 1.

EFFICIENCY OF DECISION TREES IN PREDICTING STUDENT S ACADEMIC PERFORMANCE

South Carolina College- and Career-Ready (SCCCR) Probability and Statistics

APPLICATION OF DATA MINING TECHNIQUES FOR BUILDING SIMULATION PERFORMANCE PREDICTION ANALYSIS.

Data Mining Applications in Higher Education

Data Mining Governance for Service Oriented Architecture

RUTHERFORD HIGH SCHOOL Rutherford, New Jersey COURSE OUTLINE STATISTICS AND PROBABILITY

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Module 3: Correlation and Covariance

Students Behavioural Analysis in an Online Learning Environment Using Data Mining

A Survey on Web Mining From Web Server Log

Data quality in Accounting Information Systems

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Better decision making under uncertain conditions using Monte Carlo Simulation

Big Data: Rethinking Text Visualization

Application of Predictive Model for Elementary Students with Special Needs in New Era University

Chapter 7: Simple linear regression Learning Objectives

Patent Big Data Analysis by R Data Language for Technology Management

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

2013 MBA Jump Start Program. Statistics Module Part 3

Course Syllabus MATH 110 Introduction to Statistics 3 credits

Considering Learning Styles in Learning Management Systems: Investigating the Behavior of Students in an Online Course*

Master of Science in Marketing Analytics (MSMA)

Simple linear regression

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Advice for Students completing the B.S. degree in Computer Science based on Quarters How to Satisfy Computer Science Related Electives

Linear Models in STATA and ANOVA

Practical Applications of DATA MINING. Sang C Suh Texas A&M University Commerce JONES & BARTLETT LEARNING

Lesson 1: Positive and Negative Numbers on the Number Line Opposite Direction and Value

Data Mining and Visualization

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

How To Use Neural Networks In Data Mining

not possible or was possible at a high cost for collecting the data.

HIGH DIMENSIONAL UNSUPERVISED CLUSTERING BASED FEATURE SELECTION ALGORITHM

A Statistical Text Mining Method for Patent Analysis

Decision Support System For A Customer Relationship Management Case Study

ASSOCIATION RULE MINING ON WEB LOGS FOR EXTRACTING INTERESTING PATTERNS THROUGH WEKA TOOL

Alabama Department of Postsecondary Education

A Regression Approach for Forecasting Vendor Revenue in Telecommunication Industries

Tutorial for proteome data analysis using the Perseus software platform

Advanced Ensemble Strategies for Polynomial Models

Analytics on Big Data

Fairfield Public Schools

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Business Lead Generation for Online Real Estate Services: A Case Study

Univariate Regression

Session 7 Bivariate Data and Analysis

Diablo Valley College Catalog

Azure Machine Learning, SQL Data Mining and R

IJCSES Vol.7 No.4 October 2013 pp Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS

STAT 360 Probability and Statistics. Fall 2012

What Does the Normal Distribution Sound Like?

Enhanced Boosted Trees Technique for Customer Churn Prediction Model

Financial Trading System using Combination of Textual and Numerical Data

Bisecting K-Means for Clustering Web Log data

Neural Networks in Data Mining

An Overview of Database management System, Data warehousing and Data Mining

Data Mining Individual Assignment report

Big Data with Rough Set Using Map- Reduce

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

CHURN PREDICTION IN MOBILE TELECOM SYSTEM USING DATA MINING TECHNIQUES

CALCULATIONS & STATISTICS

AN ANALYSIS OF WORKING CAPITAL MANAGEMENT EFFICIENCY IN TELECOMMUNICATION EQUIPMENT INDUSTRY

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

Transcription:

Predictive Analytics on Student Academics [1] Mr. K. Balaprasath, [2] Mr. L. Arun Raj [1] M.Tech-CSE, [2] Assistant Professor, B.S. Abdur Rahman University, Chennai. Abstract Academic performances among university students are the topic of interest in educational society. The students performance plays a significant role for the course discontinuation. A large set of academic data is used for predicting the students yearly to fulfil the degree requirements. Two data processing algorithms have been used K-Means clustering and Apriori combined with Linear Regression are applied. The proposed system is to predict the learning concert of the learners supported both academic and non academic records. Data collected from the students via Google Forms are analyzed using the mining algorithms and the results are displayed using a visualization tool. Based on the analysis the academic performance of the student could be evaluated, thereby initiating steps to enhance the teaching learning process. Keywords: Educational Data Mining (EDM), K- means Clustering, Apriori, Linear Regression. I.INTRODUCTION A Predictive Analytics is the division of data mining involved with the prediction of prospect chances and developments. Learning context helps to predict student behaviours within the areas of knowledge outcome, recruitment, and retention. Analyzing historical and present data of the student, predictive analytics can inform an institution as to take further improvement towards students perspective. It is important for the students to complete the degree in a timely fashion. Predictive analytics involves in higher education by the past and current student data to predict future students performance. This is important because it could help the students stay on track to graduate and alert them when they are falling behind. When Predictive analytics is applied in EDM, those methods can be used to study what features of a model are important for prediction, giving information about the underlying construct. This is a common approach in programs of research that attempt to predict student educational outcomes without predicting intermediate or mediating factors first Educational Data Mining (EDM) is the process to discover useful information from unprocessed data collected which can be used by the distinctive stakeholders. EDM concerns with developing methods for discovering knowledge from data that come from educational environment. EDM will apply unique tactics and approaches for exploring knowledge originating educational information systems that could be developed. II.RELATED WORKS Classification model is an important feature for building up the model using the training set. The method is to assign a predefined label or class to a record based on a set of known attributes [1]. Latent Semantic Analysis approach identifies the hidden meaning of textual information in the documents[2]. Studying type and engagement are the major factors in learning analytics, it is essential to check the learning style of the student and provide the learning materials based on the type of study that would motivate the knowledge gathering process [3]. E-learning system in education provides a lot of data about course. Analysis based on these data can help educators to make some adjustments and improve teaching efficiency. A visual analytics system will seek salient units, observe details and make decisions by context information [4]. The students demographic information in grades in each semester has taken prerequisite for modeling using Bayesian Networks, the factors affecting the students capacity in each semester could be accurately identified [5]. Learning analytics is valuable source of understanding students behaviour and giving feedback, it could be a powerful source of data for all forms of assessment. The feedback comments given by the studtents are analyzed after each lesson helps to grasp prediction error occurred in each lesson, and achieve further improvement of the student grade prediction[6]. III.METHODOLOGY ISSN: 2231-2803 http://www.ijcttjournal.org Page 93

A. Data Collection Data collection technique basically deals with gathering student information. The nonacademic data are collected from the students via questionnaires. The multiple questions are prepared using Google Forms in order to get the details of the student as shown in Fig. 1. The academic data of the students are collected from the faculties rather than the students could be reliable. because most of the data mining algorithms require the data sets to be in numerical formats is shown in Fig. 2. Fig. 2 Converted data sets. The numerical values are assigned based on the choices given for the students to fill up in the Google Forms. Fig. 1 Data collected from students via Google Forms. B. Data Pre-processing The information provided by the students may not be accurate or may not be precise. Data preprocessing is an important step in the prediction process. The collected data may contain null values and those values need to be removed before it performs operation such as grouping, plotting and so on, after pre-processing the data can be integrated with visualisation tools and apply data mining algorithms for further processing. C. Data Conversion StreamReader is used for reading the text files. It is found in the System.IO namespace. StreamReader reader = new StreamReader("..//..//STUDENTS.csv"); List<string> row = new List<string>(); "1"); while (!reader.endofstream) { string line = reader.readline(); line = line.replace("cbse", "2") line = line.replace("state BOARD", line = line.replace("others", "3"); }.., D. System Architecture The data collected from the students and advisors are pre-processed and then integrated into R- Tool environment in order to perform operations is shown in Fig. 3. The R-IDE sometimes needs non numerical values for processing based on the algorithms chosen to implement, it won t accepts the cross matching input and results an error. To avoid that.net framework is used, StremReader is used to read the file formats such as excel, csv etc and can replace with the value preferred in the program. The output of the file will be a numerical format which R environment needs, then load the file and perform the operations needed. The next task is to perform regression to identify the relationship between the one dependent or more independent attributes. After the completion of the process the data sets are analyzed to predict the student's academic performances and also identify the reason for the students who are not performed much. The report is finally generated and handled to the administrative responsible for taking such action towards the students. Using these codes the.csv file which contains students data collected via Google Forms are loaded into the.net framework and then non numeric values are converted into numeric values ISSN: 2231-2803 http://www.ijcttjournal.org Page 94

Step 10: Plot the results by comparing attributes to which the operations performed before. Fig. 3 System Architecture of the students academics. Fig. 4 Plotting the attribute set of 10 th standard data with CGPA comparison. The non academic data collected from the students are compared with academic data of the particular students by keeping the as constant i.e. the students selected the course based on their interest or someone influences. Based on the criteria the 10th and 12th standard marks are compared with the CGPA and plotted using R-Tool is shown bin Fig. 4 and Fig. 5. IV.EXPERIMENTAL ANALYSIS AND RESULTS K-Means aims to partition n observation into n clusters in which every observation belongs to the cluster with closest mean. It is an algorithm to classify or to group your objects based on attributes into K number of groups. The converted data sets are loaded into R-tool for performing K-Means operations. The following are the steps need to be followed Step 1: Choose any attribute from the student data set. Step 2: Create an object for the data set Step 3: Set the attribute NULL with the help of object. Step 4: Set the cluster size and pass it as a parameter to perform K-Means operations and store in a variable. Step 5: Perform K-Means operations for the available components such as cluster, size, centre etc.., Step 6: Store the obtained results into the table. Step 7: Choose another attribute to compare the results in order to visualize. Step 8: Assign the attribute NULL. Step 9: Repeat the steps 3, 4 and 5. Fig. 5 Plotting the attribute set of 12th Standard data with CGPA comparison. Fig. 6 Cluster formation based on the marks secured along with their passionate. There are three clusters formation for the students data sets is shown in Fig. 6. The students who are performed well in both 10th, 12th and also secured in the semester exams are in cluster 1 and so on. Apriori is a classic algorithm for learning association rules. Association rule learning is a ISSN: 2231-2803 http://www.ijcttjournal.org Page 95

method for discovering relations between variables in large databases. An algorithm will identify the frequent individual items in the database and extends them into a larger item sets as long as the item set appear sufficiently often in the database. The data sets collected from the students are chosen, the CGPA can be categorized as high, medium and low in order to match the pattern is shown below in Fig. 5. The following are the steps need to perform Apriori algorithm in R-Tool Step 1: Load the data sets required. Step 2: Apply the association rules for the selected data sets and store that in a variable. The association rules applied for the students data sets are shown above Fig.6. The coordinates x and y are considered as support and confidence and these are represented by lift. Fig.5 Data sets categorisation to perform Apriori algorithm. Step 3: Pass the parameter such as data sets, the rhs (right hand side) and lhs (left hand side)need to be categorized, a minimum length and support and confidence value in which attribute need to be calculated. Step 4: Sort the values association rules applied for. Step 5: Perform redundant operations and remove them, which is no longer required. Step 6: Store the obtained results into the table. Step 7: Plot the rules. Fig. 6 Scatter plot using the Apriori algorithm. Fig. 7 Identifying the outcome of the students. The outcome of the students with their reason among the course selection with the background and the marks secured is taken and association are applied are shown above Fig.7. Shaded portion explains the students from those combinations secured more than the others. Linear regression is the most simple and commonly used method in predictive analytics. Regression is used to describe data and relationships between one dependent variable and one or more independent variable. To perform linear regression in R-Tool the converted data sets need to be chosen instead of the normal data collected from the students. The local memory need to clean before performing linear regression in R-Tool for that rm(data set name) command is used. Then correlation coefficient is a measured for the chosen data sets between two variables. Values of the correlation coefficient are always between -1 and +1. A correlation coefficient of +1 indicates that two variables are perfectly related in a constructive linear sense; a correlation coefficient of - 1 suggests that two variables are flawlessly related in a negative linear experience, and a correlation coefficient of zero shows that there is no linear relationship between the two variables. After correlation is measured, it is essential to measure covariance for the same attributes, covariance is a measure of how much two random variables change together. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, i.e., the variables tend to show similar behaviour, the covariance is positive. In the opposite case, when the greater values of one variable mainly correspond to the lesser values of the other, i.e., the variables tend to show opposite behaviour, the ISSN: 2231-2803 http://www.ijcttjournal.org Page 96

covariance is negative. The sign of the covariance therefore shows the tendency in the linear relationship between the variables. Fig. 8(a) Residual plot Fig. 8(b) Normal q plot The residual plot is used to detect nonlinearity, unequal error variances, and outliers are shown in Fig. 8(a), If the graph equally spreads residuals around a horizontal line without distinct patterns, the attributes chosen don t have non-linear relationships. The graph will be plotted up and down like a curve will have non-linear relationships. The normal q plot shows if residuals are normally distributed is shown in Fig. 8(b), residuals follow a straight line the graph is normally distributed if deviated from the line not distributed normally. V.CONCLUSION Research has proved that the academic performance of a student is affected by both his/her academic and non-academic factors. The system were designed and developed to collect the prime academic and non-academic details from university level students using automated forms. Later these collected data are cleaned; pre-processed and standard mining algorithms were applied. The algorithms considered are K-Means clustering and Apriori algorithm. The mined results are displayed both in text form and graphical form. Using these results, the education system comprising of teachers, management and students can take measures to enhance the teaching learning process. As a result the educational performance of the work will be improved upon. This work can be further enhanced to predict the learning patterns of students and to identify exactly the factors that affect the performance of the students in their academic arena. REFERENCES [1] Camilo Ernesto Lopez Guarín, Elizabeth León Guzman and Fabio A. Gonzalez, A Model to Predict Low Academic Performance at a Specific Enrollment Using Data Mining, IEEE Journal of Latin-American Learning Technologies, vol.10, no.3, pp.119-125, 2015. [2] Shaymaa E.Sorour, Tsunenori Mine, Kazumasa Goda and Sachio Hi rokawa, Comments Data Mining for Evaluating Students Performance, International Conference on Advanced applied Informatics, pp.25-3, 2014. [3] Nidyanandan Pratheesh and Devi Thiru pathi, Sensation of Learning Analytics to Prevail the Software EngineeringEducation, International Conference on Advanced Computing and Communication Systems, pp.1-7, 2013. [4] Xin Li and Xuehui Zhang and Xin Liu, A Visual Analytics Approach for E- learning Education, International conference on Innovative Mobile and Internet Services in Ubiquitous Computing, pp.34-40, 2015. [5] Ashkan Sharabiani, Fazle Karim, Anooshiravan Sharabiani, Mariya Atanasov and Houshang Darabi, Member, "An Enhanced Bayesian Network Model for Prediction of Students Academic Performance in Engineering Programs", Global Engineering Education Conference, pp.832-837, 2014. [6] Shaymaa E. Sorour, Jingyi Lu, Kazumasa Goda and Tsunenori Mine, "Correlation of Grade Prediction Performance with Characteristics of Lesson Subject", International Conference on Advanced Learning Technologies, pp.247-49, 2015. ISSN: 2231-2803 http://www.ijcttjournal.org Page 97