STOCK MARKET TRENDS USING CLUSTER ANALYSIS AND ARIMA MODEL



Similar documents
Data Mining 資 料 探 勘. 分 群 分 析 (Cluster Analysis)

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

USING THE AGGLOMERATIVE METHOD OF HIERARCHICAL CLUSTERING AS A DATA MINING TOOL IN CAPITAL MARKET 1. Vera Marinova Boncheva

A Comparative Study of clustering algorithms Using weka tools

Comparison of K-means and Backpropagation Data Mining Algorithms

City University of Hong Kong. Information on a Course offered by Department of Management Sciences with effect from Semester A in 2010 / 2011

Standardization and Its Effects on K-Means Clustering Algorithm

How To Solve The Kd Cup 2010 Challenge

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Customer Classification And Prediction Based On Data Mining Technique

A Two-Step Method for Clustering Mixed Categroical and Numeric Data

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

USE OF ARIMA TIME SERIES AND REGRESSORS TO FORECAST THE SALE OF ELECTRICITY

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

OUTLIER ANALYSIS. Data Mining 1

CLUSTER ANALYSIS FOR SEGMENTATION

Chapter 27 Using Predictor Variables. Chapter Table of Contents

Identifying erroneous data using outlier detection techniques

Data Mining Solutions for the Business Environment

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

How To Plan A Pressure Container Factory

Using Data Mining for Mobile Communication Clustering and Characterization

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Using JMP Version 4 for Time Series Analysis Bill Gjertsen, SAS, Cary, NC

I. Introduction. II. Background. KEY WORDS: Time series forecasting, Structural Models, CPS

Least Squares Estimation

How To Predict Web Site Visits

Segmentation of stock trading customers according to potential value

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: X DATA MINING TECHNIQUES AND STOCK MARKET

Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

City University of Hong Kong. Information on a Course offered by the Department of Management Sciences with effect from Semester A in 2012 / 2013

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Forecasting of Paddy Production in Sri Lanka: A Time Series Analysis using ARIMA Model

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

430 Statistics and Financial Mathematics for Business

DATA MINING TECHNIQUES AND APPLICATIONS

Support Vector Machines with Clustering for Training with Very Large Datasets

Social Media Mining. Data Mining Essentials

Studying Achievement

Local outlier detection in data forensics: data mining approach to flag unusual schools

APPLICATION OF THE VARMA MODEL FOR SALES FORECAST: CASE OF URMIA GRAY CEMENT FACTORY

Chapter 7. Cluster Analysis

How To Solve The Cluster Algorithm

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

A Statistical Text Mining Method for Patent Analysis

Joseph Twagilimana, University of Louisville, Louisville, KY

Introduction to time series analysis

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Cluster Analysis: Advanced Concepts

Neural Networks in Data Mining

A Regression Approach for Forecasting Vendor Revenue in Telecommunication Industries

Statistical Databases and Registers with some datamining

How To Identify Noisy Variables In A Cluster

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n

S.Thiripura Sundari*, Dr.A.Padmapriya**

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

An Overview of Knowledge Discovery Database and Data mining Techniques

Financial Trading System using Combination of Textual and Numerical Data

Getting Correct Results from PROC REG

Application of Data Mining Techniques in Intrusion Detection

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Unsupervised Outlier Detection in Time Series Data

Knowledge Discovery in Stock Market Data

Data Mining Techniques - SPSS Clementine

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

DATA MINING, DIRTY DATA, AND COSTS (Research-in-Progress)

A Study on Stock Market Analysis for Stock Selection Naïve Investors Perspective using Data Mining Technique

3 Results. σdx. df =[µ 1 2 σ 2 ]dt+ σdx. Integration both sides will form

Charles University, Faculty of Mathematics and Physics, Prague, Czech Republic.

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

Map-Reduce Algorithm for Mining Outliers in the Large Data Sets using Twister Programming Model

Promotional Forecast Demonstration

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Cluster Analysis: Basic Concepts and Algorithms

2.1. Data Mining for Biomedical and DNA data analysis

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC Politecnico di Milano)

Data Mining Part 5. Prediction

Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems

Teaching Multivariate Analysis to Business-Major Students

Distances, Clustering, and Classification. Heatmaps

Vector Spaces; the Space R n

AN INTRODUCTION TO OPTIONS TRADING. Frans de Weert

Classification Techniques (1)

Cluster analysis with SPSS: K-Means Cluster Analysis

CLUSTERING FOR FORENSIC ANALYSIS

Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar

Shilpi Bansal Ph.D. Scholar Mewar University, Chittorgarh, Rajasthan (India), Asst. Professor MCA Programme, IPEM, Ghaziabad (India),

An Introduction to Cluster Analysis for Data Mining

STANDARDISATION OF DATA SET UNDER DIFFERENT MEASUREMENT SCALES. 1 The measurement scales of variables

Prediction Models for a Smart Home based Health Care System

Transcription:

Stock Asian-African Market Trends Journal using of Economics Cluster Analysis and Econometrics, and ARIMA Model Vol. 13, No. 2, 2013: 303-308 303 STOCK MARKET TRENDS USING CLUSTER ANALYSIS AND ARIMA MODEL Jyoti Badge * ABSTRACT Stock market data is highly chaotic and it contains large amount of unwanted data. Detecting and removing the outliers are very important problem of stock market. If the outliers are present in the data, it will give misleading results and it also losses the performance of prediction. In this paper K-mean clustering is applied for clustering the stock market data and then Euclidean distances are calculated for detecting the outliers. The future trend of stock market are predicted using ARIMA time series forecasting model and is applied on clustered data. The results are encouraging and predicted results are good for estimating the future trends. Keywords: Outliers, K-mean Clustering, Euclidean distance, ARIMA Model. 1. INTRODUCTION Stock market forecasting is a tedious task due to its dynamic nature. Professional stock market investors are also not in a position to predict the market trend as number of factors influences it. It is not easy to predict market by considering simple factors like past performance, earning assets of the company and the overall trend in the sectors. Stock market data may contain outliers which effect forecasting. Outliers are the data which are inconsistent with the remaining data in a set [1, 3]. Identifying outliers in the data set is not simple. Many data mining algorithm have been designed to minimize the influence of outlier or eliminate them all together. In this paper K-mean clustering is applied to group the data and the euclidean distance is used to find the outliers in the data. 2. METHODOLOGY The stock market data passes through a three step process for forecasting the stock market trends. The three steps are (a) normalization of stock market data (b) formation of clusters using K-mean clustering (c) finding the outliers using Euclidean distance within the cluster (d) Applying ARIMA on clustered data. The attributes open price, high price, low price, close price and trading volume are used in the model. The data is normalized using z normalization. Z normalization is defined as z Y M S Y [4].It means that we eliminate the original unit of Y measurement by subtracting the mean and that dividing by the standard deviation normalizes * Assistant Professor, IPER, Bhopal, E-mail: jyoti.badge@gmail.com

304 Jyoti Badge the financial data attributes. Clustered are formed using K-mean clustering [4, 5]. K-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. Mathematically K- mean clustering is defined as Where x i ( j) j 2 k n 2 ( j ) i j (1) j1 i1 J x c ( j) c is a chosen distance measure between a data point x and the cluster centre c j, is an indicator of the distance of the n data points from their respective cluster centres. Some of the outliers in each group are removed from the data set before applying forecasting method. Algorithm of K-mean clustering is based on Euclidean distance between two data points. This distance of data points in each cluster is found by finding out centroid of the each cluster. Euclidean distance is the ordinary distance between two points [4]. The Euclidean distance between points P = (p 1, p 2,..., p n ) in Euclidean n-space is defined as: n 2 2 2 2 1 1 2 2 n n i i i1 ( p q ) ( p q )... ( p q ) ( p q ) (2) In this paper closing price is predicted using ARIMA model [2, 7]. ARIMA model is very flexible and widely used in stock market forecasting. It is a combination of Auto regression, differencing of the series and moving average. Each of the three process type has its own characteristic way of responding to a random disturbance. The resulting equation is in the form Zt 1zt 1 2zt 2 pzt p at 1at 1 2at 2... qat q (3) where, i, i = 1.2,..., p are the auto regressive parameters, j, j = 1, 2,..., q are the moving average parameters, Z t is obtained by differencing the original time series d times, a t is the white noise component at t, p is the order of auto regression parameters, q is the order of moving average parameters, d is the order of differencing. The aim of this method is to obtain best estimates of the parameters. 3. EMPIRICAL RESULT We have performed this experiment using randomly selected data from a heavy electrical company i.e. from 1-Jan.-2006 to 31-Dec.-2007. The financial attributes that are considered for prediction are open, high, low, close and volume. The stock market data is normalized using z-values. Figure 1 shows the un-normalized data and Figure 2 shows the normalized data. The normalized data is clustered using K-mean clustering. The final cluster centers are found by SPSS. The result is shown in Table 1. i

Stock Market Trends using Cluster Analysis and ARIMA Model 305 Figure 1: Q-Q Plot of Original Close Price Figure 2: Q-Q Plot of Normalized Close Price

306 Jyoti Badge Table 1 Final Cluster Centers Cluster 1 2 3 4 5 Open Price -.607.490 1.47 -.120-1.43 High Price -.625.459 1.49 -.0819-1.43 Low Price -.615.507 1.45 -.123-1.43 Close Price -.620.479 1.47-0.994-1.43 Volume -.740 -.799.366.857.546 Euclidean distances were found between the final cluster centers. Greater distances between clusters correspond to greater dissimilarities. Table2 shows the distances between final cluster centre Table 2 Distances between Final Cluster Centers Cluster 1 2 3 4 5 1-2.203 4.330 1.898 2.086 2 2.203-2.301 2.036 4.072 3 4.330 2.301-3.202 5.830 4 1.898 2.036 3.202-2.681 5 2.086 4.072 5.830 2.681 - The presence of outliers, unusually large or small values in data can affect the clustering of observations. The clusters are typically larger when outliers are not removed, and the resulting solution is more confused. For further analysis outliers are removed from the final cluster centers. Figure 3 shows the diagnostic plot of cluster number of cases and distances of cases its classification cluster centre. This graph helps to find outliers within clusters. After finding outliers from the data ARIMA time series model is applied in each cluster. Table 3 shows the actual and predicted close price when removal and without removal of outliers. Mean absolute error and mean absolute percentage error are calculated when outliers are not removed from the data and it is 9.46 and 0.40% likewise Mean absolute error and mean absolute percentage error when outliers are removed from the data and it is 2.32 and.098%. From the empirical results it is clear that with removing outliers from the financial data set gives more better results as compared to the without removing outliers.

Stock Market Trends using Cluster Analysis and ARIMA Model 307 Figure 3: Box Plot between Cluster Number of Cases and Distance of Case from its Classification Cluster Centre Table 3 Actual and Predicted Close Price when Removing and not Removing of Outliers from the Data S.No Predicted Predicted Actual Variation of Variation of Percentage Percentage Close price Close price close price close price close price of error of error when outliers when outliers when outliers when outliers when outliers when outliers are not are removed are not are removed are not are removed removed from from the removed removed the data data 1 2293.02 2289.67 2283.75-9.27-5.92 0.41 0.25 2 2377.06 2353.65 2355.55-21.51 1.9 0.91 0.08 3 2352.76 2360.59 2363.75 10.99 3.16 0.46 0.13 4 2378.00 2391.51 2392.45 14.45 0.94 0.60 0.03 5 2255.54 2247.38 2249.5-6.04 2.12 0.27 0.09 6 2289.40 2288.58 2288.25-1.15-0.33 0.05 0.01 7 2382.69 2380.13 2379.7-2.99-0.43 0.13 0.01 8 2349.38 2335.78 2337.65-11.73 1.87 0.50 0.07 9 2464.22 2455.90 2450.35-13.87-5.55 0.57 0.22 10 2459.78 2463.39 2462.35 2.57-1.04 0.10 0.04

308 Jyoti Badge 4. CONCLUSION In this paper we detect and deflate the influence of outliers from the stock market data. The experimental result indicates that there is an improvement on the prediction result when removing outliers from the data set. In terms of mean absolute error and mean absolute percentage error removing outliers result is significantly reduced forecasting error. K-mean clustering is an effective tool which helps to groups the data in a similar patterns and Euclidean distance helps to find the outliers within the clusters. So for getting better forecasting results one should reduced the effect of outliers from the financial data attributes. Reference Barnett, V. and T. Lewis, (1994), Outliers in Statistical Data. John Wiley. Box & Jenkins Time Series Analysis: Forecasting and Control. Holden-day, San Francisco 1976. Hawkins, D., (1980), Identifications of Outliers, Chapman and Hall, London. Jiawei Han and Micheline Kamber (2001), Data Mining: Concepts and Techniques. By. Academic Press, Morgan Kaufmarm Publishers. Kaufman, L. and P. Rousseeuw, (1990), Finding Groups in Data: An Introduction to Cluster Analysis John Wiley & Sons. PAOLO GIUDICI Applied Data Mining: Statistical Methods for Business and Industry 2003 John Wiley & Sons Ltd, the Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ, England. Walter Vandaele (1983), Applied Time Series & Box-Jinkins Models Academic Press.