Stock Asian-African Market Trends Journal using of Economics Cluster Analysis and Econometrics, and ARIMA Model Vol. 13, No. 2, 2013: 303-308 303 STOCK MARKET TRENDS USING CLUSTER ANALYSIS AND ARIMA MODEL Jyoti Badge * ABSTRACT Stock market data is highly chaotic and it contains large amount of unwanted data. Detecting and removing the outliers are very important problem of stock market. If the outliers are present in the data, it will give misleading results and it also losses the performance of prediction. In this paper K-mean clustering is applied for clustering the stock market data and then Euclidean distances are calculated for detecting the outliers. The future trend of stock market are predicted using ARIMA time series forecasting model and is applied on clustered data. The results are encouraging and predicted results are good for estimating the future trends. Keywords: Outliers, K-mean Clustering, Euclidean distance, ARIMA Model. 1. INTRODUCTION Stock market forecasting is a tedious task due to its dynamic nature. Professional stock market investors are also not in a position to predict the market trend as number of factors influences it. It is not easy to predict market by considering simple factors like past performance, earning assets of the company and the overall trend in the sectors. Stock market data may contain outliers which effect forecasting. Outliers are the data which are inconsistent with the remaining data in a set [1, 3]. Identifying outliers in the data set is not simple. Many data mining algorithm have been designed to minimize the influence of outlier or eliminate them all together. In this paper K-mean clustering is applied to group the data and the euclidean distance is used to find the outliers in the data. 2. METHODOLOGY The stock market data passes through a three step process for forecasting the stock market trends. The three steps are (a) normalization of stock market data (b) formation of clusters using K-mean clustering (c) finding the outliers using Euclidean distance within the cluster (d) Applying ARIMA on clustered data. The attributes open price, high price, low price, close price and trading volume are used in the model. The data is normalized using z normalization. Z normalization is defined as z Y M S Y [4].It means that we eliminate the original unit of Y measurement by subtracting the mean and that dividing by the standard deviation normalizes * Assistant Professor, IPER, Bhopal, E-mail: jyoti.badge@gmail.com
304 Jyoti Badge the financial data attributes. Clustered are formed using K-mean clustering [4, 5]. K-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. Mathematically K- mean clustering is defined as Where x i ( j) j 2 k n 2 ( j ) i j (1) j1 i1 J x c ( j) c is a chosen distance measure between a data point x and the cluster centre c j, is an indicator of the distance of the n data points from their respective cluster centres. Some of the outliers in each group are removed from the data set before applying forecasting method. Algorithm of K-mean clustering is based on Euclidean distance between two data points. This distance of data points in each cluster is found by finding out centroid of the each cluster. Euclidean distance is the ordinary distance between two points [4]. The Euclidean distance between points P = (p 1, p 2,..., p n ) in Euclidean n-space is defined as: n 2 2 2 2 1 1 2 2 n n i i i1 ( p q ) ( p q )... ( p q ) ( p q ) (2) In this paper closing price is predicted using ARIMA model [2, 7]. ARIMA model is very flexible and widely used in stock market forecasting. It is a combination of Auto regression, differencing of the series and moving average. Each of the three process type has its own characteristic way of responding to a random disturbance. The resulting equation is in the form Zt 1zt 1 2zt 2 pzt p at 1at 1 2at 2... qat q (3) where, i, i = 1.2,..., p are the auto regressive parameters, j, j = 1, 2,..., q are the moving average parameters, Z t is obtained by differencing the original time series d times, a t is the white noise component at t, p is the order of auto regression parameters, q is the order of moving average parameters, d is the order of differencing. The aim of this method is to obtain best estimates of the parameters. 3. EMPIRICAL RESULT We have performed this experiment using randomly selected data from a heavy electrical company i.e. from 1-Jan.-2006 to 31-Dec.-2007. The financial attributes that are considered for prediction are open, high, low, close and volume. The stock market data is normalized using z-values. Figure 1 shows the un-normalized data and Figure 2 shows the normalized data. The normalized data is clustered using K-mean clustering. The final cluster centers are found by SPSS. The result is shown in Table 1. i
Stock Market Trends using Cluster Analysis and ARIMA Model 305 Figure 1: Q-Q Plot of Original Close Price Figure 2: Q-Q Plot of Normalized Close Price
306 Jyoti Badge Table 1 Final Cluster Centers Cluster 1 2 3 4 5 Open Price -.607.490 1.47 -.120-1.43 High Price -.625.459 1.49 -.0819-1.43 Low Price -.615.507 1.45 -.123-1.43 Close Price -.620.479 1.47-0.994-1.43 Volume -.740 -.799.366.857.546 Euclidean distances were found between the final cluster centers. Greater distances between clusters correspond to greater dissimilarities. Table2 shows the distances between final cluster centre Table 2 Distances between Final Cluster Centers Cluster 1 2 3 4 5 1-2.203 4.330 1.898 2.086 2 2.203-2.301 2.036 4.072 3 4.330 2.301-3.202 5.830 4 1.898 2.036 3.202-2.681 5 2.086 4.072 5.830 2.681 - The presence of outliers, unusually large or small values in data can affect the clustering of observations. The clusters are typically larger when outliers are not removed, and the resulting solution is more confused. For further analysis outliers are removed from the final cluster centers. Figure 3 shows the diagnostic plot of cluster number of cases and distances of cases its classification cluster centre. This graph helps to find outliers within clusters. After finding outliers from the data ARIMA time series model is applied in each cluster. Table 3 shows the actual and predicted close price when removal and without removal of outliers. Mean absolute error and mean absolute percentage error are calculated when outliers are not removed from the data and it is 9.46 and 0.40% likewise Mean absolute error and mean absolute percentage error when outliers are removed from the data and it is 2.32 and.098%. From the empirical results it is clear that with removing outliers from the financial data set gives more better results as compared to the without removing outliers.
Stock Market Trends using Cluster Analysis and ARIMA Model 307 Figure 3: Box Plot between Cluster Number of Cases and Distance of Case from its Classification Cluster Centre Table 3 Actual and Predicted Close Price when Removing and not Removing of Outliers from the Data S.No Predicted Predicted Actual Variation of Variation of Percentage Percentage Close price Close price close price close price close price of error of error when outliers when outliers when outliers when outliers when outliers when outliers are not are removed are not are removed are not are removed removed from from the removed removed the data data 1 2293.02 2289.67 2283.75-9.27-5.92 0.41 0.25 2 2377.06 2353.65 2355.55-21.51 1.9 0.91 0.08 3 2352.76 2360.59 2363.75 10.99 3.16 0.46 0.13 4 2378.00 2391.51 2392.45 14.45 0.94 0.60 0.03 5 2255.54 2247.38 2249.5-6.04 2.12 0.27 0.09 6 2289.40 2288.58 2288.25-1.15-0.33 0.05 0.01 7 2382.69 2380.13 2379.7-2.99-0.43 0.13 0.01 8 2349.38 2335.78 2337.65-11.73 1.87 0.50 0.07 9 2464.22 2455.90 2450.35-13.87-5.55 0.57 0.22 10 2459.78 2463.39 2462.35 2.57-1.04 0.10 0.04
308 Jyoti Badge 4. CONCLUSION In this paper we detect and deflate the influence of outliers from the stock market data. The experimental result indicates that there is an improvement on the prediction result when removing outliers from the data set. In terms of mean absolute error and mean absolute percentage error removing outliers result is significantly reduced forecasting error. K-mean clustering is an effective tool which helps to groups the data in a similar patterns and Euclidean distance helps to find the outliers within the clusters. So for getting better forecasting results one should reduced the effect of outliers from the financial data attributes. Reference Barnett, V. and T. Lewis, (1994), Outliers in Statistical Data. John Wiley. Box & Jenkins Time Series Analysis: Forecasting and Control. Holden-day, San Francisco 1976. Hawkins, D., (1980), Identifications of Outliers, Chapman and Hall, London. Jiawei Han and Micheline Kamber (2001), Data Mining: Concepts and Techniques. By. Academic Press, Morgan Kaufmarm Publishers. Kaufman, L. and P. Rousseeuw, (1990), Finding Groups in Data: An Introduction to Cluster Analysis John Wiley & Sons. PAOLO GIUDICI Applied Data Mining: Statistical Methods for Business and Industry 2003 John Wiley & Sons Ltd, the Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ, England. Walter Vandaele (1983), Applied Time Series & Box-Jinkins Models Academic Press.