A Web-based Interactive Data Visualization System for Outlier Subspace Analysis

Transcription

1 A Web-based Interactive Data Visualization System for Outlier Subspace Analysis Dong Liu, Qigang Gao Computer Science Dalhousie University Halifax, NS, B3H 1W5 Canada Hai Wang Sobey School of Business Saint Mary s University Halifax, NS, B3H 3C3 Canada hwang@smu.ca Ji Zhang Mathematics & Computing University of Southern Queensland Toowoomba, QLD, 4350 Australia Ji.Zhang@usq.edu.au Abstract Detecting outliers from high-dimensional data is a challenge task since outliers mainly reside in various lowdimensional subspaces of the data. To tackle this challenge, subspace analysis based outlier detection approach has been proposed recently. Detecting outlying subspaces in which a given data point is an outlier facilitates a better characterization process for detecting outliers for high-dimensional data stream, and make outlier mining for large high-dimensional data set to be more manageable. In this paper, to facilitate outlier subspaces analysis from human perception perspectives in supporting the development of efficient solutions for high-dimensional data, we propose a web-based interactive data visualization system, which can display various low-dimensional outlier subspaces to allow users to observe and analyze the distributions of outliers. The proposed visualization tool can help the developers of outlier detection applications to directly examine the distributions of outliers in various low-dimensional subspaces to validate their experiment results. 1 Introduction Outliers in a database or data stream are the data objects that are grossly different from or inconsistent with the rest of the data, which reflect abnormal behaviours in the real world. Outliers may stand for toxin spills in chemical sensor data, the network intrusions in network log data, cancers in medical data, or simply some errors or noises caused by human mistakes or sensor damage, etc [11, 12, 13]. Outliers should be treated differently in different situations, such as errors and noises outliers should be removed, and intrusion and cancer outliers are targets and should be detected for analysis and event prevention. In other situation, outliers must be detected and classified properly. Traditional outlier detection methods are mainly been designed using whole dimensionality analysis approach. They work well for low-dimensional data sets. However, nowadays more and more real applications are involved in high-dimensional data. Detecting outlier from highdimensional data is a challenging task, in that traditional methods become infeasible for high-dimensional data due to the Curse of Dimensionality phenomena, in that the outliers hidden in low-dimensional subsets of the data will be disappeared as the dimensionality is increased for using whole dimensionality analysis methods [2]. The new strategy to deal with high-dimensional data is to detect outliers for possible lower dimensional subspaces of the high-dimensional data, such as introduced in [1]. The idea is to convert the issue of outlier detection in the high-dimensional data space into the issue of detecting low-dimensional outlying subspaces since exhaustive search all subspaces in high-dimensional data space is not tractable. In this paper, we propose a data visualization system to facilitate analysis and solution development for projected outlier subspace finding and gain insight by allowing the developers/users to observe the data

2 distributions for various low-dimensional outlier subspace of the data. Visualization has been proved to be a useful tool for data analysis. With development of computer hardware and software, visualization techniques can use computer graphics to create visual images which aid in understanding of complex, often massive representations of data. There are a number of visualization tools available, such as SequoiaView [3], GGobi [6], OpenViz [7], VisuMap [8] and ADVIZOR [9]. Some tools are webbased systems for the continence of accessing the tool for broad user groups, such as Manyeyes [4] and Drillet [5]. However, there is no data visualization system for directly analyzing projected outlier subspaces. In this paper, we present a visualization system for outlier subspace analysis in that the features and interface tools are special designed for effectively supporting human to observe and explore large volume high-dimensional data for gaining insight on outlier detection on such complex data sets. 2 System Design and Implementation The proposed visualization system is designed for supporting outlier analysis on high-dimensional data in that human perception can play a role for gaining insight on outlier subspaces, which is based on the concept of Stream Projected Outlier Detector (SPOT) [1]. In SPOT system, the problem of detecting projected outliers from high-dimensional data streams is formulated as follows. Given a data streamd with a potentially unbounded size of ϕ-dimensional data points, each data point pi = {pi1, pi2,..., pi'} in D will be labeled as either a projected outlier or a regular data point. If pi is a projected outlier, its associated outlying subspace(s) will be given as well. The results to be returned will be a set of projected outliers and their associated outlying subspace(s) to indicate the context where these projected outliers exist. The results, denoted by A, can be formally expressed as A = {<o, S >,o O and S is the outlying subspace set of o}, where O denotesset of outliers detected. The visualization system aims to help users to examine the detected outlying subspaces for highdimensional data set. Users are allowed to adjust the parameters of the outlier detection algorithms and visualize the intermediate detection results. A set of visualization tools is designed for supporting human exploration on projected outlier subspace analysis. 2.1 System Architecture The architecture of the visualization system is illustrated in Figure 1. The data to be displayed can include both the original high-dimensional data set and the outlier detection results after data pre-processing which includes standard steps of data cleaning, data transformation and data normalization. Data cleaning is to remove incorrect records in the dataset. Data transformation is to correct inconsistent data format and convert continuous data attribute values into a finite set of intervals with minimal loss of information. In data normalization, we will find out the minimum and maximum value for each dimension and convert value between 0 and 1. Figure 1 System Architecture For the prepared high-dimensional data, one data point may be considered as outlier in many subspaces,

3 therefore the outlier detection result may be very large. In order to handle large size of outlier detection results, the system to use a database to store the datasets and the information of outlying subspaces. After data preparation stage, both the datasets and the outliers are stored into two tables in the database. By doing so, the database server can quickly retrieve the selected data for feeding into the visualization system for display. With the prepared data sets, the user should be able to access the system through internet with a web browser. The system allows the user to select different subspaces and views to display. According to user s subspace selection, the system will connect to the database server with JDBC and send queries to database server. The retrieved data and outlier information for the selected subspaces will be transmitted to client machine over internet and displayed in user s web browser. The database and web application services are at server side. On the client side, user can access the web services and visualize data and outliers for the selected subspaces from the web browser. The system also allows the user to visualize different datasets by reading data file name specified by the user from user s local machine. The system is implemented in Java. The client machine needs to install J2SE 5 and Java 3D 1.5 or higher version to run the system. 2.2 Synthetic Datasets In the experiments, both synthetic data and real data sets are used. The synthetic data is generated randomly by a high-dimensional data generator used in SPOT research [1]. The nature of the data is close to real-life data. It exhibits different data characteristics in projections of different subsets of features. It consists of 15 attributes and 10,000 lines of data. The outlier detection result directly from SPOT method [1] consists of 426,513 outliers from one dimensional to three dimensional subspaces. Below is a sample of the first two detected outlying subspaces in the file. Outlierness Threshold: 3 ***************************************** Top outlier: data #1 In subspace: 11 Cell index: 1 Outlier-ness: Top outlier: data #2 In subspace: 1 6 Cell index: 15 6 Outlier-ness: Field Type Description linenumber int(11) Primary Key. Row number of data. valume1 double Attribute 1 valum2 double Attribute valume15 double Attribute 15 Table 1 Schema of Data in Database Field Type Description id int(11) Primary Key and identify each outlying subspace. linenumber int(11) Row number of data. dimension1 int(11) Attribute 1 of outlying subspace. dimension2 int(11) Attribute 2 of outlying subspace. dimension3 int(11) Attribute 3 of outlying subspace. outlierness double Outlierness of outlier. Table 2 Schema of Outlier Information in Database Since the outlier detection result contains only outlying subspaces of 1, 2 and 3 dimensional subspaces. The corresponding data tables and outlier table are created in the database. The detailed schema of the data table is

4 illustrated in Table 1. The detailed schema of the outlier table is given in Table 2. The attribute values of outlying subspaces are sorted in ascending order. For onedimensional outlying subspaces, the values of dimension2 and dimension3 are NULL. Similarly, for twodimensional outlying subspaces, the attribute of dimension3 is NULL. For three dimensional outlying subspaces, values of all dimensions are not NULL. 2.3 Real-life Datasets The experiments also include real-life data sets, i.e. the KDD Cup 1999 data [10], which is a log connection traffic data set from MIT/Lincoln-Lab. It contains connections detail in its network such as the protocoltype, duration, service-use and many related information. We use the first 5000 lines of the data from the corrected data with labels for our visualization. In the preprocessing stage, we separate label information from datasets into a separated file. The label names are transformed into numbers. Each type of network intrusion is mapping to one number. There are four types (shown in Table 3) of network intrusion labelled in the first 5000 lines data. We use the number of outlier type as outlierness value. In this way, we can visualize the distribution of different kind of network intrusion. Table 3 Label Mapping 3 Experiments and System Demonstration cases for both synthetic datasets and KDD 1999 network log data. The visualization system can help to answer questions on the outlier detection. For examples, 1. In a two-dimensional subspace of the synthetic datasets, find out whether a selected particular outlier data point is also an outlier in other two-dimensional subspaces. 2. What distribution of smurf network attacks is in KDD 1999 data? Case 1: In a two-dimensional subspace, find out whether a selected outlier data point is also considered as an outlier in other two-dimensional subspaces. For answering this question, we visualize four twodimensional subspaces (as shown in Figure 2) which are (Dim4, Dim 6), (Dim3, Dim 6),( Dim 12, Dim 10) and (Dim 2, Dim 4). When click one outlier (index #174) in subspace (Dim 4, Dim 6), then click the Concurrent button in other two-dimensional subspace display windows. We can easily observe that the outlier data point (index #174) in (Dim4, Dim6) is also considered as outlier in (Dim3, Dim 6) and (Dim 2, Dim 4). Moreover, we may change the outlierness threshold by moving slide bar in these two windows. We can get the outlierness value of data point (index 174) is in both (Dim3, Dim 6) and (Dim 2, Dim 4). Case2: Visualize distribution of smurf network attack in KDD 1999 data. The example of visualizing the distribution of outliers in three-dimensional subspaces is shown in Figure 3. We may find out that the smurf network attacks are mainly resided closely in the marked area in the selected threedimensional subspace. Figure 4 is an example of use concurrent display of two-dimensional subspaces. The system reports the selected outlier from the subspace in left window is also marked as an outlier in the other subspace in the right window. The experiments are developed based on sample

5 Figure 2 Case 1: Two-Dimensional Subspaces Concurrent Display Figure 3 Case2: 3D Display Figure 4 Case2: 2D Concurrent Display

6 4 Conclusion and Future Work The proposed web-based visualization system can help to observe subspaces of high-dimensional datasets interactively. - The system enables the user to evaluate performance of an outlier detection algorithm by visually verifying the correctness of the results, and determining a proper parameter for better outlier detection results. Through visualizing datasets and their labelled results, user can gain insight visually on what real facts are about the data distribution nature and the outlier distribution. It is also useful for comparing the effectiveness of different algorithms. The user may also adjust the values of different parameters of the algorithms for comparing the changes of performance. This system currently can visualize datasets and their labelled outlier information. It can interact with user and help to explore the datasets and outlier subspaces. In the future work, we may make the system to allow users to directly label outliers from selected subspaces. Users may also manually adjust outlierness value for selected outlier data points for observing sensitivity of the data. Moreover, the system may be integrated with different outlier detection algorithms such as the SPOT algorithm in [1]. [5] Drillet Visual Tool for interactive data analysis, [6] Data Visulization system: GGobi, [7] Data Visulization system: OpenViz, [8] Data Visulization system: VisuMap, [9] Data Visulization system: ADVIZOR, [10] KDD data source: [11] B. Aleskerov, E. Freisleben and B. Rao. Cardwatch: A Neural Network Based Database Mining System for Credit Card Fraud Detection. Computational Intelligence for Financial Engineering (CIFEr), [12] J. F. Costa. Reducing the Impact of Outliers in Ore Reserves Estimation. Mathematical Geology, 35(3), [13] J. Han and M. Kamber. Data Mining: Concepts and Techniques, 2nd ed. Morgan Kaufmann Publishers, References [1] J. Zhang, Q. Gao and H. Wang. SPOT: A System for Detecting Projected Outliers from High-dimensional Data Streams. IEEE 24th International Conference on Data Engineering (ICDE 08), Cancun, Mexico, pp , [2] R. Bellman. Adaptive Control Processes: A Guided Tour. Princeton University Press, [3] Data Visualization system: Sequoiaview, [4] Data Visualization system: Manyeyes,