International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 8 August 2013


1 A ShortTerm Traffic Prediction On A Distributed Network Using Multiple Regression Equation Ms.Sharmi.S 1 Research Scholar, MS University,Thirunelvelli Dr.M.Punithavalli Director, SREC,Coimbatore. Abstract: A new approach is proposed to predict the fractal behavior of a distributed network traffic, in which a random scaling fractal model is used to simulate the selfaffine characteristics ofa network traffic.a study of the network traffic is done by sniffing a portion of it using Wireshark. The sniffed traffic is inspected and dissected using filter option, for each differentprotocols. The fractal behavior of the traffic are sniffed and examined by an opensource network analyzer. Later, the packet records that were sniffed are exported to NeuroSolutions builder,spss andthen examined. Further, the exported and dissected traffic data is fed as input to train the neural network to let it predict the resultant fractal behavior of the distributed network traffic and an equation is proposed to derive the ultimate close network traffic prediction in SPSS. Keywords: fractal behavior, sniffing, predict, SPSS, NeuroSolution builder, NeuroXL predictor. I INTRODUCTION For the examination of local problems in a small network, monitoring at a single observation point is sufficient to train the network builder. For such cases, a network analyzer may be used which can be a machine running Wireshark and is directly connected to a network segment or the monitoring port of a switch or a router. In larger networks, it is often necessary to perform simultaneous monitoring at multiple observation points to train the constructed neural network in a more efficient manner. In this research a Neural Network(Multilayer Perceptron)is proposed to be used to predict the dependent variable values over different independent variable value distributions using two specific modeling tools, viz., SPSS and NeuroSolutions. One objective of this is to find the effect of the dependent variable values distributions in the dataset using different modeling tools on the Neural Network prediction performance. A second objective is to compare the performance of the two modeling tools in the predictionof the dependent variable values. Analyzing packet records with wireshark Wireshark [1], formerly known as Ethereal is probably the most popular opensource network analyzer tool. For the experiments, we configured Wireshark on our machine to capture network packets. The data collected is exported in Comma Separated Value (.csv) format. Wireshark can be divided into four main modules: Capture Core, WireTap, Protocol Interpreter and Dissector. Capture Core uses the common library WinPcap to capture data from different network (Ethernet, Ring, etc.); once the data is obtained, WireTap is used to save it as a binary file; since the data is in binary, without the Protocol Interpreter and Dissector, user cannot understand the data. Dissector can be available in a builtin or a plugin mode. The proposed approach allows profiting from Wireshark's extensive packet inspection facility and protocol dissection capabilities for distributed network analysis. Neuro solutions The NeuralBuilder helps to construct the neural network by selecting parameters. The four currently available problem types in the NeuralExpert are Classification, Prediction, Function Approximation, and Clustering. Later, a parameter list is selected to train the neural network and the desired traffic is output to train the network. ISSN: Page 2452
2 Figure 1. Flow diagram to deploy traffic prediction using ANN. An ANN is a computational method motivated by biological models. ANNs attempt to mimic the fundamental operation of the human brain and can be used to solve a broad variety of problems [10]. One of the most important features of ANNs is that it can discover hidden patterns from data sets [11], and solve complex problems especially when a mathematical model does not exist (or when the model is not suitable for the case at hand). Furthermore, ANNs are commonly immune to noise and irregularities present in the data [12, 13]. ANN learning is typically based on two data sets: the training set and the validation set. The training set is used on a new artificial neural network, as its name indicates, for training. The validation set is used after the neural network has been trained to assess its performance. The validation set in most case is similar to the training set but not same [14, 15 ]. Data mapping In artificial intelligence, a desired output is commonly known as the target. For the specific case of ANNs, the target is used for network training [9]. ANNs can map a given input to a desired output; when an ANN is used for this purpose, the ANN is typically called a mapping ANN. The network is trained by applying the desired input to the ANN, and then monitoring the actual ANN output. The difference between the actual ANN output and the desired output is normally used to manage the learning process. During the process of training, the learning algorithm attempts to reduce the error measured between the actual network output and the targetin the training set [9, 11]. The training process may be time consuming, but when the process has been successfully completed, an ANN canquickly calculate its output once the input data has ISSN: Page 2453
3 Start Translate the network traffic data parameters. Train the NN s architecture for N number of epochs. Step: 1 Dissect the network traffic dataset and enlist the Step : 5 Perform Prediction Original expected traffic Step : 2 Evaluate the performance Step :3 Criteria Satisfy Step : 4 Extract a new traffic dataset dissected N Y Figure 2: Flow diagram of ANN been applied to the network input. Data classification Data classification or just classification is the process of identifying an object from a set of possible outcomes [9, 12]. An ANN Stop using NeuroSolutions: can be trained to identify and classify any kind of objects. These objects can be numbers, images, sounds, signals, etc. An ANN used for this purpose is also known as a classifier. Figure 3. Training fractaldataset graph ISSN: Page 2454
4 The traffic data is trained initially with a network trafficdataset that had been downloaded from wireshark sample captures as a pcap file and the data is exported to network builder for prediction. The predicted fractal behavior on the traffic data set is shown in table 1. II INVESTIGATION OF CORRELATION COEFFICIENT VALUE On investigating the effect of dependent variable values and the distribution on the prediction accuracy rate. The results of the analyses lets us to find the effect of the dependent variable values distribution on prediction accuracy that exploits and leads us generating an equation that would predict the expected traffic based on the independent variablevalues distribution using the modeling tool SPSS. Correlation Coefficient, R, is a measure of the strength of the association between the independent (explanatory) variables and the dependent (prediction) variable.r is never a negative value. This can be seen from the formula below, since the square root of this value indicates the positive root[2,3]. Formula for R,Formula for two independent variables, X1 and X2 The coefficient of multiple correlation estimates the combined influence of two or more variables on the observed (dependent) variable. To analyse the traffic data using multiple regression, part of the process involves the following assumptions to be verified[8]. The dependent variable is measured on a continuous scale. Two or more independent variables, are continuous or categorical. Observatios should be recorded. Linear relationship exists between the dependent variable and each of the independent variables. Traffic data shows homoscedasticity, which is where the variances along the line of bestfit remain similar as one move along the line. The data does not show multicollinearity, which occurs when two or more independent variables are highly correlated. There are no significant outliers, high leverage points or highly influential points. Residuals (errors) are approximately normally distributed. The above listed assumptions are not violated and henceforth the Multiple Correlation Coefficient, R, is computed to measure the strength of the association between the independent (explanatory) variables and a single dependent (prediction) variable. Multiple Regressionbooster prediction phases: In MRBooster, by using each feature of the association existing between the actual traffic and the dissected traffic explicitly helped to generate the prediction equation and the standard error factor when probed in further boosts a better way to refine the regression equation that predicts the network traffic. The correlation structure of traffic is finally generated in a much easier way. Phase 1: a. The sniffed traffic data is plotted as a scatter plot graph to visualize if there is a possible linear relationship. b. Calculate and interpret the linear correlation coefficient, using the data sets. ISSN: Page 2455
5 Phase 2: c. Determine all possible regression equation for the data by refining it further by adjusting the constant standard error from it. d. Select and apply the best generated regression equation and forecast. Phase 3: e. Identify outliers and note the observations. f. Process and interpret the performance of, Rbooster prediction. Table 1.Descriptive Statistics(SPSS) Mean Std. Deviation N ActualTraffic Trafficn Trafficn Trafficn Model Table 2.Correlation Coefficientsa (adependent actual trafficgraph) Unstandardized Coefficients R Std. Error Beta Standardized Coefficients 1 (Constant) T Sig. Network1(n1) Network2(n2) Network3(n3) The equation generated to predict the actual traffic that could be generated for the following dissected protocoltraffic. Predicted traffic(w.r.t time slice)=n1 *(R( n1) standard Errorn1) + n2 *(R(n2) standard Error) + n3 * (R(n3) standard Error) + (Rconstant standard Error) Predictedtraffic=Trafficn1*0.873+Trafficn2*1.015+Trafficn3* R value of traffic from n1 and n2 have a strong association with the actual traffic, where as traffic from n3 has a weak association is shown in table 3. R value Table 3.R value strength. Interpretation 0.9 strong association 0.5 moderate association 0.25 weak association ISSN: Page 2456
6 Figure 4. Actualtraffic vs Predicted traffic(neurosolutions) and Computedtraffic(SPSS) The figure 4, shows that the traffic computed using the generated equation is very close to the actualtargettraffic. III PERFORMANCE EVALUATION The overall performance of the analyzed prediction methods are stated here to estimate the prediction accuracy. Coefficient Efficiency(E) is one such estimation method that measures the performance and reveals the efficiency rate. The efficiency coefficient can take values in the domain (, 1]. If E = 1, we have a perfect fit between the observed and the forecasted data. A value of E = 0 occurs when the prediction corresponds to estimating the mean of the actual values. An efficiency less than zero, i.e. < E < 0, indicates that the average of the actual values is a better predictor than the analyzed forecasting method. The closer E is to 1, the more accurate the prediction is as the coefficient efficiency stays at 0.9 for the forecasted traffic IV CONCLUSION The experimental results demonstrate that 1) the regression model is more effective for traffic prediction; and 2) both the proposed prediction equation and standard error based R(correlation coefficient) update scheme are effective to predict the traffic in a easier way.the goal of the experiments is to evaluate and to compare the performance of the ANN prediction approaches presented earlier in this paper. Hence, the linear regression model offers is a powerful tool for analyzing the association between one or more independent variables and a single dependent variable. Some novice researchers wish to move quickly beyond this model and learn to use more sophisticated models because they get discouraged about its limitations and believe that other regression models are more appropriate for their analysis needs. References [1]Wireshark Homepage, www. wireshark.org, [2] ClearSight Networks, Inc. Homepage, [3] /multipleregressionusingspssstatistics. php [4] _ correlation [5] %20Stats/Lecturestests/Test%202/Week12 assumptions.pdf [6] WildPackets, Inc. Homepage, http: //www. wildpackets.com, [7] S. Waldbusser, Remote Network Monitoring Management InformationBase, RFC 2819 (Standard), May [8] T. Masters, Practical Neural Network Recipes in C++. Preparing Input Data (C16), Academic Press, Inc., pp , (1993). [9] S. J. Russel and P. Norvig, Artificial Intelligence: A Modern Approach.PrenticeHall of India, Second Edition.Statistical Learning Methods (C20), pp , (2006). [10] T. Masters, Neural, Novel & Hybrid Algorithms for Time Series Prediction. Neural Network Tools (C10), John Wiley & Sons Inc., pp , (1995). [11] T. Masters, Signal and Image Processing With Neural Networks. Data Preparation for Neural Networks (C3), John Wiley & Sons Inc., pp , (1994). [12] T. Masters, Advanced Algorithms for Neural Networks. Assessing Generalization Ability (C9), John Wiley & Sons Inc., pp , (1995). [13] R. D. Reed and R. J. Marks II, Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks. Factors Influencing Generalization (C14), The MIT Press, pp , (1999). [14] _Lab ISSN: Page 2457
More information