Predictive Analytics on Student Academics [1]

Transcription

1 Predictive Analytics on Student Academics [1] Mr. K. Balaprasath, [2] Mr. L. Arun Raj [1] M.Tech-CSE, [2] Assistant Professor, B.S. Abdur Rahman University, Chennai. Abstract Academic performances among university students are the topic of interest in educational society. The students performance plays a significant role for the course discontinuation. A large set of academic data is used for predicting the students yearly to fulfil the degree requirements. Two data processing algorithms have been used K-Means clustering and Apriori combined with Linear Regression are applied. The proposed system is to predict the learning concert of the learners supported both academic and non academic records. Data collected from the students via Google Forms are analyzed using the mining algorithms and the results are displayed using a visualization tool. Based on the analysis the academic performance of the student could be evaluated, thereby initiating steps to enhance the teaching learning process. Keywords: Educational Data Mining (EDM), K- means Clustering, Apriori, Linear Regression. I.INTRODUCTION A Predictive Analytics is the division of data mining involved with the prediction of prospect chances and developments. Learning context helps to predict student behaviours within the areas of knowledge outcome, recruitment, and retention. Analyzing historical and present data of the student, predictive analytics can inform an institution as to take further improvement towards students perspective. It is important for the students to complete the degree in a timely fashion. Predictive analytics involves in higher education by the past and current student data to predict future students performance. This is important because it could help the students stay on track to graduate and alert them when they are falling behind. When Predictive analytics is applied in EDM, those methods can be used to study what features of a model are important for prediction, giving information about the underlying construct. This is a common approach in programs of research that attempt to predict student educational outcomes without predicting intermediate or mediating factors first Educational Data Mining (EDM) is the process to discover useful information from unprocessed data collected which can be used by the distinctive stakeholders. EDM concerns with developing methods for discovering knowledge from data that come from educational environment. EDM will apply unique tactics and approaches for exploring knowledge originating educational information systems that could be developed. II.RELATED WORKS Classification model is an important feature for building up the model using the training set. The method is to assign a predefined label or class to a record based on a set of known attributes [1]. Latent Semantic Analysis approach identifies the hidden meaning of textual information in the documents[2]. Studying type and engagement are the major factors in learning analytics, it is essential to check the learning style of the student and provide the learning materials based on the type of study that would motivate the knowledge gathering process [3]. E-learning system in education provides a lot of data about course. Analysis based on these data can help educators to make some adjustments and improve teaching efficiency. A visual analytics system will seek salient units, observe details and make decisions by context information [4]. The students demographic information in grades in each semester has taken prerequisite for modeling using Bayesian Networks, the factors affecting the students capacity in each semester could be accurately identified [5]. Learning analytics is valuable source of understanding students behaviour and giving feedback, it could be a powerful source of data for all forms of assessment. The feedback comments given by the studtents are analyzed after each lesson helps to grasp prediction error occurred in each lesson, and achieve further improvement of the student grade prediction[6]. III.METHODOLOGY ISSN: Page 93

2 A. Data Collection Data collection technique basically deals with gathering student information. The nonacademic data are collected from the students via questionnaires. The multiple questions are prepared using Google Forms in order to get the details of the student as shown in Fig. 1. The academic data of the students are collected from the faculties rather than the students could be reliable. because most of the data mining algorithms require the data sets to be in numerical formats is shown in Fig. 2. Fig. 2 Converted data sets. The numerical values are assigned based on the choices given for the students to fill up in the Google Forms. Fig. 1 Data collected from students via Google Forms. B. Data Pre-processing The information provided by the students may not be accurate or may not be precise. Data preprocessing is an important step in the prediction process. The collected data may contain null values and those values need to be removed before it performs operation such as grouping, plotting and so on, after pre-processing the data can be integrated with visualisation tools and apply data mining algorithms for further processing. C. Data Conversion StreamReader is used for reading the text files. It is found in the System.IO namespace. StreamReader reader = new StreamReader("..//..//STUDENTS.csv"); List<string> row = new List<string>(); "1"); while (!reader.endofstream) { string line = reader.readline(); line = line.replace("cbse", "2") line = line.replace("state BOARD", line = line.replace("others", "3"); }.., D. System Architecture The data collected from the students and advisors are pre-processed and then integrated into R- Tool environment in order to perform operations is shown in Fig. 3. The R-IDE sometimes needs non numerical values for processing based on the algorithms chosen to implement, it won t accepts the cross matching input and results an error. To avoid that.net framework is used, StremReader is used to read the file formats such as excel, csv etc and can replace with the value preferred in the program. The output of the file will be a numerical format which R environment needs, then load the file and perform the operations needed. The next task is to perform regression to identify the relationship between the one dependent or more independent attributes. After the completion of the process the data sets are analyzed to predict the student's academic performances and also identify the reason for the students who are not performed much. The report is finally generated and handled to the administrative responsible for taking such action towards the students. Using these codes the.csv file which contains students data collected via Google Forms are loaded into the.net framework and then non numeric values are converted into numeric values ISSN: Page 94

3 Step 10: Plot the results by comparing attributes to which the operations performed before. Fig. 3 System Architecture of the students academics. Fig. 4 Plotting the attribute set of 10 th standard data with CGPA comparison. The non academic data collected from the students are compared with academic data of the particular students by keeping the as constant i.e. the students selected the course based on their interest or someone influences. Based on the criteria the 10th and 12th standard marks are compared with the CGPA and plotted using R-Tool is shown bin Fig. 4 and Fig. 5. IV.EXPERIMENTAL ANALYSIS AND RESULTS K-Means aims to partition n observation into n clusters in which every observation belongs to the cluster with closest mean. It is an algorithm to classify or to group your objects based on attributes into K number of groups. The converted data sets are loaded into R-tool for performing K-Means operations. The following are the steps need to be followed Step 1: Choose any attribute from the student data set. Step 2: Create an object for the data set Step 3: Set the attribute NULL with the help of object. Step 4: Set the cluster size and pass it as a parameter to perform K-Means operations and store in a variable. Step 5: Perform K-Means operations for the available components such as cluster, size, centre etc.., Step 6: Store the obtained results into the table. Step 7: Choose another attribute to compare the results in order to visualize. Step 8: Assign the attribute NULL. Step 9: Repeat the steps 3, 4 and 5. Fig. 5 Plotting the attribute set of 12th Standard data with CGPA comparison. Fig. 6 Cluster formation based on the marks secured along with their passionate. There are three clusters formation for the students data sets is shown in Fig. 6. The students who are performed well in both 10th, 12th and also secured in the semester exams are in cluster 1 and so on. Apriori is a classic algorithm for learning association rules. Association rule learning is a ISSN: Page 95

4 method for discovering relations between variables in large databases. An algorithm will identify the frequent individual items in the database and extends them into a larger item sets as long as the item set appear sufficiently often in the database. The data sets collected from the students are chosen, the CGPA can be categorized as high, medium and low in order to match the pattern is shown below in Fig. 5. The following are the steps need to perform Apriori algorithm in R-Tool Step 1: Load the data sets required. Step 2: Apply the association rules for the selected data sets and store that in a variable. The association rules applied for the students data sets are shown above Fig.6. The coordinates x and y are considered as support and confidence and these are represented by lift. Fig.5 Data sets categorisation to perform Apriori algorithm. Step 3: Pass the parameter such as data sets, the rhs (right hand side) and lhs (left hand side)need to be categorized, a minimum length and support and confidence value in which attribute need to be calculated. Step 4: Sort the values association rules applied for. Step 5: Perform redundant operations and remove them, which is no longer required. Step 6: Store the obtained results into the table. Step 7: Plot the rules. Fig. 6 Scatter plot using the Apriori algorithm. Fig. 7 Identifying the outcome of the students. The outcome of the students with their reason among the course selection with the background and the marks secured is taken and association are applied are shown above Fig.7. Shaded portion explains the students from those combinations secured more than the others. Linear regression is the most simple and commonly used method in predictive analytics. Regression is used to describe data and relationships between one dependent variable and one or more independent variable. To perform linear regression in R-Tool the converted data sets need to be chosen instead of the normal data collected from the students. The local memory need to clean before performing linear regression in R-Tool for that rm(data set name) command is used. Then correlation coefficient is a measured for the chosen data sets between two variables. Values of the correlation coefficient are always between -1 and +1. A correlation coefficient of +1 indicates that two variables are perfectly related in a constructive linear sense; a correlation coefficient of - 1 suggests that two variables are flawlessly related in a negative linear experience, and a correlation coefficient of zero shows that there is no linear relationship between the two variables. After correlation is measured, it is essential to measure covariance for the same attributes, covariance is a measure of how much two random variables change together. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, i.e., the variables tend to show similar behaviour, the covariance is positive. In the opposite case, when the greater values of one variable mainly correspond to the lesser values of the other, i.e., the variables tend to show opposite behaviour, the ISSN: Page 96

5 covariance is negative. The sign of the covariance therefore shows the tendency in the linear relationship between the variables. Fig. 8(a) Residual plot Fig. 8(b) Normal q plot The residual plot is used to detect nonlinearity, unequal error variances, and outliers are shown in Fig. 8(a), If the graph equally spreads residuals around a horizontal line without distinct patterns, the attributes chosen don t have non-linear relationships. The graph will be plotted up and down like a curve will have non-linear relationships. The normal q plot shows if residuals are normally distributed is shown in Fig. 8(b), residuals follow a straight line the graph is normally distributed if deviated from the line not distributed normally. V.CONCLUSION Research has proved that the academic performance of a student is affected by both his/her academic and non-academic factors. The system were designed and developed to collect the prime academic and non-academic details from university level students using automated forms. Later these collected data are cleaned; pre-processed and standard mining algorithms were applied. The algorithms considered are K-Means clustering and Apriori algorithm. The mined results are displayed both in text form and graphical form. Using these results, the education system comprising of teachers, management and students can take measures to enhance the teaching learning process. As a result the educational performance of the work will be improved upon. This work can be further enhanced to predict the learning patterns of students and to identify exactly the factors that affect the performance of the students in their academic arena. REFERENCES [1] Camilo Ernesto Lopez Guarín, Elizabeth León Guzman and Fabio A. Gonzalez, A Model to Predict Low Academic Performance at a Specific Enrollment Using Data Mining, IEEE Journal of Latin-American Learning Technologies, vol.10, no.3, pp , [2] Shaymaa E.Sorour, Tsunenori Mine, Kazumasa Goda and Sachio Hi rokawa, Comments Data Mining for Evaluating Students Performance, International Conference on Advanced applied Informatics, pp.25-3, [3] Nidyanandan Pratheesh and Devi Thiru pathi, Sensation of Learning Analytics to Prevail the Software EngineeringEducation, International Conference on Advanced Computing and Communication Systems, pp.1-7, [4] Xin Li and Xuehui Zhang and Xin Liu, A Visual Analytics Approach for E- learning Education, International conference on Innovative Mobile and Internet Services in Ubiquitous Computing, pp.34-40, [5] Ashkan Sharabiani, Fazle Karim, Anooshiravan Sharabiani, Mariya Atanasov and Houshang Darabi, Member, "An Enhanced Bayesian Network Model for Prediction of Students Academic Performance in Engineering Programs", Global Engineering Education Conference, pp , [6] Shaymaa E. Sorour, Jingyi Lu, Kazumasa Goda and Tsunenori Mine, "Correlation of Grade Prediction Performance with Characteristics of Lesson Subject", International Conference on Advanced Learning Technologies, pp , ISSN: Page 97