Correlation discovery from network monitoring data in a big data cluster

Transcription

1 Correlation discovery from network monitoring data in a big data cluster Kim Ervasti University of Helsinki kim.ervasti@gmail.com ABSTRACT Monitoring of telecommunications network is a tedious task. Diverse monitoring data is propagated through the network to different kinds of monitoring software solutions. Processing rules are needed for monitoring the data. As manually entering the rules is not an option in today s telecommunication network sizes, automatic data mining inclusive solutions are needed. Existing solutions of data mining the network monitoring data are impacted by performance issues of computationally demanding algorithms. A scalable distributed solution is needed. This paper surveys existing studies on distributed solutions for data mining and processing the network monitoring data. We also summarize and evaluate experiment results of these studies and propose direction for future research. Categories and Subject Descriptors D.4.7 [Organization and Design]: Distributed Systems Keywords Big data, network monitoring, event correlation, decentralized data mining, frequent item set, frequent patterns, assocation rules, mapreduce, apache hadoop 1. INTRODUCTION Telephone companies and telecommunications operators have critical need to monitor their systems: networks and services. Bad quality or even downtime of a service causes loss of money and profits to both the operator and its customers. A telephone company offers broad catalogue of services. For example a telephone company could serve mobile phone networks, broadband networks, core networks, hosting services, web services and nowadays even streaming services like Internet movie rentals, home security services etc., as value adding services. These services are provided on top of mixture of technologies and networks. Also the network equipment rooms and server rooms, the latter often referred as colocation centres, must be monitored for the safety hosting of these equipment. Network monitoring consists of monitoring data, measurements and event messages from the network equipment. To understand the causes and causalities of this received data, the data must be processed. Processing the network monitoring data is referred as correlating. To correlate the monitoring data, it is required to determine what data is important, which alarms are related to each other, and which alarm indicates the real cause for a network issue. These correlation rules are either defined by network experts or generated automatically after discovering the rules by data mining the monitoring data. Telecommunication networks are continuously growing both in size and complexity. Today there is an urgent need for data mining solutions to automate discovering the effective correlation rules. Countless data mining solutions do exist in this field, but they mostly are not scalable to handle the rapidly growing amount of data they should be able to process. This paper surveys the existing studies on scaling up these data mining solutions of network monitoring data to big data scale. The paper is organized as follows: Section 2 presents a quick overview on the architecture behind network monitoring. Section 3 demonstrates how monitoring data is used to detect faults in the network and why is this non-trivial. Section 4 then presents how this problem can be solved and improved by using data mining. Section 5 presents what are the potential issues in scaling up the data mining of network monitoring data. Section 6 surveys the existing studies and evaluates the results of these studies and makes further proposal. Section 7 concludes the paper. 2. NETWORK MONITORING ARCHITEC- TURE Telecommunications operators must monitor their services and networks to assure the functionality of the network and availability of the services provided. All of these networks, consisting different technologies, equipment rooms, server rooms, servers, and services on top of them are monitored by wide spectrum of complex monitoring software solutions. Each monitoring software solution used is specialised to monitor specific part of the network: the monitoring protocols and methods vary depending on what is being monitored. The variety is so great that a common

2 Management Management Events Events Notifications Automations Ticketing Network monitoring system Network devices Radio access network Servers Equipment rooms Figure 1: Network monitoring architecture monitoring software solution to monitor everything is practically impossible to make. Equipment manufacturers tend to make the monitoring software specialised to their equipment in a closed source manner. In addition to monitoring, these software products work also as network element managers required to operate the equipment. Equipment licenses, if nothing else, might prevent using other software for managing as well. Figure 1 illustrates a simplified hierarchy of network monitoring. At the bottom, there are the monitored equipment, referred as network nodes. On the upper level there are the element managers or equipment specific monitoring software solutions monitoring and managing these specific technologies. They are used by network operators - experts on the specific technology in question. These monitoring systems in turn forward the monitoring data to a collective network monitoring system capable of processing this monitoring data on a higher level by combining data and discovering correlations of events from different sources. By processing the data, important network events are detected and commonly these three options exist to make use of the discovered information: 1) customer notifications can be generated to notify customers of the situation, 2) automations can be executed to automatically try to fix the issue, for example resetting the equipment, and 3) a service ticket requesting repairs can be generated to a subcontractor responsible for the equipment. These options are not mutually exclusive. There are different ways to monitor the network. Network equipment, equipment rooms, servers and services are most often monitored from three aspects: 1. by collecting continuous metric measurements data, also known as performance data, 2. by collecting somewhat continuous end-toend (E2E) testing data, and 3. by receiving discrete event messages, referred to as events, from the network nodes. Monitoring software solutions are usually only specialised in one of these ways to monitor the network. Performance data is gathered by collecting numeric data from equipment device sensors, network router transfer rates, amount of point-to-point connections (telephone calls), number of Short Message Service (SMS) messages (also known as text messages) sent per hour, etc. Performance data can be transformed to discrete event data. This is called discretization. Discretization is done by defining threshold values that when exceeded by the measurement value, cause an event to be generated. For example, if the temperature measurement of an CPU increases above 70 C degrees, an alarm with minor severity is created. If the temperature measurement further increases above 80 C, another event with major severity is created. If the temperature decreases back to normal, let s say below 50 C, an acknowledging event with severity of normal is created. End-to-end data is results of various end-to-end tests or measurements. As a few examples, 1) how fast a server or a service is responding from end-to-end, 2) how fast can 1024 kilobytes be transferred through 3G mobile network from device to device, 3) is a specific service responding? The results of these end-to-end tests are also often transformed to events, for example when the test fails (the tested service is not responding) or the measurement result is above threshold (transfer rate is too slow).

3 Event data is about equipment hardware or a monitoring agent on a server sending an event message to the monitoring system of an event detected on the device. A network event is for example a device startup, network interface linkdown, or a measurement data value increasing over a predefined threshold. 3. NETWORK FAULT ISOLATION Network monitoring is about collecting mainly two kinds of equipment state information from the network: 1. state changes as events, and 2. metric information as measurements, as described earlier. Network monitoring systems do not magically get information of the possible problems in the network [4], because the broken equipment might not be able to send alarm events to the monitoring systems or the degraded equipment is unable detect the problem by itself. By gathering measurement and status information, a network monitoring system can collect enough known symptoms to detect the fault. In other words, in a case of a problem in the network or in a service, the monitoring data usually contains symptoms of the fault. A simple example is a broken hard drive in a server. The server sends an alarm to the monitoring system that the hard drive is unreadable. This event is a very clear symptom of a fault situation occurred: a broken hard drive. However it is not always this trivial. As an example of this is the following scenario of a broken network router. In this example, the router broke without a warning, and therefore it did not manage to send any alarm event before the fault occurred and now the router is unable to handle any network traffic. But the monitoring data indicates alarm events coming from other routers in the network saying that network links to this specific router are all degraded or down. From these symptoms we can conclude that the router is in fact faulted. For a fault, there can be different kinds of symptoms depending on the case. Very often multiple different symptoms are observed as a result of a single fault in the network. Event messages are produced based on those symptoms. It is not rare that a single fault in the network actually produces a storm of events. The underlying real reason for the occurred fault, the root cause, is not always obvious even when a fault in the network is detected. In other words the fault in the network is the problem produced by a certain root cause. In this example, the fault is lost or degraded network connectivity around the broken router, and the broken router equipment is the root cause for this fault. Symptoms of the fault are slow or zero transfer speeds or high packet loss when using this degraded network link. Another more complex, but a very concrete correlation example is illustrated in figure 2. In the figure it is shown that there is a fault in connectivity in Multiprotocol Label Switching (MPLS) core network between two Dense wavelength division multiplexing (DWDM) transponders. As an effect, the connectivity to the Provider Edge (PE) router is lost, and a Digital Subscriber Line Access Multiplexer (DSLAM) using that router has lost its connectivity, and all the different equipment using that DSLAM for connectivity are now out of service: a small 3G base station site, and several corporal networks. All three of these types of equipment, DSLAM, 3G site and Corporal networks, are under active monitoring of an element manager or monitoring software, so it is quickly discovered that the connection to those services is lost. Alarms are generated from all of these systems individually. The monitoring of the MPLS core network also creates an alarm from the lost link to one of the DWDM transponders. Although a total of four alarms are being generated, only one of them is directly indicating to the device needing repairs. It is however non-trivial to discover that these four alarms are actually related to each other when received in a central network monitoring system, seen in figure 1. Finding the root cause for the fault requires root cause analysis based on the available monitoring data of the symptoms. The monitoring data must be combined and correlated. By correlating the network monitoring data we can find which events and measurement data are related to a certain problem and which data is not. By combining this problem related data, we can further track down the root cause for the problem. For example, if we have observations that we have lost a connection to exactly all the base stations under a certain Base Station Controller (BSC), we can conclude from this that the problem is not any of the base stations individually, but either the BSC or the transmission link to the BSC. Transmission links provides the transfer of the phone calls, text messages and mobile data between the BSC and the operator, and also the management communication connection to the BSC, but as we do not observe any alarm event of this specific transmission link that provides the connectivity to this BSC, we can conclude that the problem is BSC being broken. To discover the root cause this way, correlation between the alarms must be discovered first. That is, which received alarms or measurements are associated with each other, and which ones are not. For example, if we receive two alarms from two different Base Stations, it is not known automatically whether those alarms have real correlation to each other or not. With data mining approaches correlation information can be discovered as event correlation patterns [20]. 4. DATA MINING OF THE ALARM DATA Identifying the root cause for a network fault, the alarms related to the fault must be discovered. This can be done if the network topology is known: a faulted device obviously affects the devices topologically related to the faulted device. Unfortunately in a big telephone company the network topology is rarely known in a format that could be used in root cause analysis. This is especially true if the alarms related to the fault are emitted from several different types of networks. Multiple topologies of different types of networks would be needed and the connections between them would be needed to be known. Services served on top of physical network, or often a combination of different types of networks, are rarely mapped to the underlying network as a combined topology. As services are often mapped to logical hosts and connections instead of the physical ones, linking them all together as a single combined topology is a complex task. If the topology is not available, the effective relations among network equipment must be data mined. Because of the huge networks of telecommunications companies, this approach requires lots of learning data being provided from a lengthy time period. This makes the data size very big and the data mining therefore computationally very time consuming task.

4 MPLS Core 3G site Alarm DWDM Fault DWDM Provider Edge router DSLAM Alarm Corporal Networks Alarm Figure 2: Fault impact in the network Discovered patterns of network events and measurements can also be used for pre-emptive monitoring of the network [18]. Proactive network management in the form of failure prediction is a hot topic in today s telephone companies. Low severity alarms, warnings and information about routine state changes, are continuously received from network devices. Rarely they indicate a real problem occurring in the near future. Which low severity alarms are an indication of an occurring problem, and with what probability, is not known a priori. Unfortunately, even equipment manufacturers themselves usually do not know this either. That is why it has been a trend to discard these alarms all together. Reacting to everyone of them is not possible when the network is in telephone company scale. A low severity alarm or a group of alarms predicting a real problem is a statistic that can be discovered by mining monitoring data, if enough learning data is present. Ren et al. [14] put it a bit differently: if we can find a statistical rule that identifies a correlation between warning events and a fatal event, the occurrence of those warning events indicates the elevated probability of the fatal event occurring in the near future. The figure 3 uses graphical form to illustrate two different, but related predictive patterns discovered from event data. In the example, three different events about symptoms 1, 2 and 3, received in this order, are known to precede a network fault with high probability. Also two events about symptoms 1 and 2 in this order are known to precede the same fault, but with lower probability as shown with edge D in the figure. Symptom 1 alone does not seem to indicate any fault. In general, patterns of low severity events leading to severe fault situations do exist in the network monitoring data. When combined with network measurement data, even more useful patterns exist. As the patterns exist, they can also be discovered with right tools. Discovering these hidden patterns in the data is the basis of fault prediction. Because of the massive amount of the data, instead of human work data mining approaches must be applied to find them. In terms of data mining, alarm correlations are regarded as alarm association rules [20]. Association rules are discovered by searching frequent item sets from the data. The found frequent item sets can then be transformed to association rules. Frequent item sets are sets of one or more items, discovered from the data, that occur in data frequently. The occur- A B C Symptom 1 Symptom 2 Symptom 3 Problem Figure 3: Pre-emptive pattern rence frequency of an item set is called as the support value of the item set. Support value is defined as the amount of transactions that have the item set divided by the amount of all transactions. The threshold for an item set being frequent is a user predefined value, called as minimum support value. Frequent item sets are often data mined from transactional data. An easy and typical real life example of transactional data is market purchase data: collections of shopping items bought together by customers. In this example, item sets present sets of merchandise that are often bought together. The association rules present causalities in frequent item sets. For example, if bread and butter are frequently bought together, and therefore can be discovered as being a frequent item set, an association rule saying that customers that buy bread, also buy butter often could be discovered. Discovering association rules is very useful for finding predictive patterns from the network monitoring data. Figure 3 could present the following association rule: if symptoms 1, 2, 3 are detected from the network, the Problem will occur. A confidence value is attached to each association rule. The value presents the probability that the rule holds. The confidence of a rule is rarely 100%. For example, some people might want to eat their bread without butter, so they buy their bread without it. Therefore the rule could have confidence of 80%. In the figure 3, the confidence of the edge D presented, saying if symptoms 1 and 2 occur, the Problem will occur has a lower confidence than the rule of detecting all the symptoms first, the edge C. Association rules are generated from discovered frequent item sets. In addition to frequent item set discovery, rare events, like catastrophic failures, which obviously do not occur frequently, can also be of interest. In that case, rare item sets are searched instead [18, 19, 17, 15]. However, the rare tar- D Time

5 get event must be known and specified. Data mining is then applied to discover the predictive pattern leading to it. After performing either rare or frequent item set data mining and transforming them to association rules, a confidence can be calculated for each rule. After discovery of such patterns and rules, high confidence rules can be selected to be searched from the new incoming alarm data stream in real time. After detecting a such discovered pattern, a new high severity alarm can be generated in advance with additional information about the root cause before the fault has actually occurred. Depending on whether the fault can actually be prevented or not, either pre-emptive actions or preparations for repairments can be made. 5. CHALLENGES IN DATA MINING THE NETWORK MONITORING DATA Apriori algorithm is very commonly proposed and used for mining frequent item sets and discovering association rules from transactional data [1]. The Apriori works as follows: First, it finds all the single items in the data, that are frequent, in other words, the support value of the item is greater than the predefined minimum support value. Second, it then generates pairs from all the frequent items discovered in step one. Third, the occurrences of these candidate pairs in the data is then counted. The candidates having a support value lower than the minimum support, are pruned. After this step, frequent item sets of two items have been discovered. Fourth, then item sets of three items are generated among the frequent pairs and items as new candidate item sets of size three, and then counted and pruned similarly to candidate pairs. This procedure is continued by raising the size of the item set until no item set is more frequent than the minimum support value defines. All the item sets discovered to be more frequent than the minimum support value during the runtime of the algorithm, are frequent item sets. The length of the item sets vary, but of course the interest can be outlined to specific length of sets. The classic Apriori algorithm suffers from exponential exploring space in mining the frequent item sets [20] and it needs to scan through the data multiple times. Telecommunications networks are often huge and evergrowing not just in size, but also in complexity and variety of different technologies used. The amount of measurement data in telecommunications network monitoring is daily in terabyte scale. Event data is smaller than measurement data, hundreds of megabytes or gigabytes per day, but as discovery of frequent sets is computationally expensive [20], even this amount of data introduces challenges for centralized data mining. Both measurement and event data combined makes the data easily to big data scale. Therefore, to find the patterns in reasonable time, the data mining must be scalable [22]. Apriori algorithm is incapable to discover the temporal relationship among alarms [20]. Temporality and timing order of alarms has a central role in correlations [20, 14]. Thus the Apriori algorithm must be modified to work in parallel in a distributed platform and to take temporality into account. 6. SURVEY OF PROPOSED SOLUTIONS Distributing Apriori with MapReduce [2, 6] is widely studied and proposed as implementations. MapReduce is a programming model and implementation for processing massive data sets in parallel in a distributed fashion. Users define only the map and reduce functions and the underlying framework automatically handles the distribution of the computation across the cluster, hardware failures and load balancing. Just to name few important groups of contributors on MapReduce distributed Apriori [22, 8, 7, 23]. However, only very few authors have addressed the problem also in the context of network event correlation, so the algorithms lack the ability to take temporality into account. Other big data scaled data mining solutions exist to discover patterns from for example market purchase data, medical data and financial data, but this solutions also lack the temporal, timing order or data dimensionality capabilities network monitoring data requires [14]. As described above, network events have an important temporal relationship which also has to be taken into account. This also excludes the possibility to make use of the considerable amount of studies made for big data solutions to data mine market purchase data, medical data financial data. 6.1 Study 1: Distributed Apriori Wu et al. [20] proposed the following interesting features in their study: 1. MapReduce-Apriori algorithm, a MapReduce based implementation of Apriori to discover frequent item sets in parallel. 2. translation of network events into transaction form taking also temporality of events into account. 3. decentralized solution of the database to make parallel data mining possible. 4. an approach to filter irrelevant events locally to reduce the total amount of alarm data and further improve the performance. The problem with data mining the network monitoring data in parallel is in distributing the alarm data. As the correlations among alarms are not known a priori, there is no way of knowing which alarms are needed at which mining node either. For this, Wu et al. proposed solution of shared file system between the nodes, the Global Distributed File System (GDFS). Each node uses their local storage as a cache and can upload and download files to and from the GDFS. Wu et al. [20] processed all monitoring data into events. Discretizating the measurement data counts on predefined threshold values, as described earlier. Because the thresholds are defined by an expert, a risk exists for losing interesting anomalies which might pose relevant information for patterns when combined with other event data, if the thresholds are set too high or low by misunderstanding the behaviour of the equipment, for example. On the other hand, discretizating the measurement data makes the data much easier to apply for data mining and also radically reduces the size of the data. Wu et al. [20] transformed the collected events into transactions. Their monitoring environment processes events only in durations of five minutes. This simplifies dividing the

6 TID Time slot Transaction 1 00:00 Node 1: event 1, Node 2: event 2, Node 2: event 3, Node 3: event :05 Node 1: event 1, Node 2: event 2, Node 2: event :10 Node 1: event 1, Node 2: event 3, Node 3: event 2, Node 4: event :15 Node 2: event 3, Node 3: event 2 Table 1: Example of events transformed in to transactions alarms in to transactions, but in real world use it also introduces unwanted bias to the temporal information of the events. An example of alarms divided in to transactions is illustrated in table 1. In this simplified example only 4 different nodes, numbered from 1 to 4, have sent events and the events only have one attribute: an alarm identification number. Alarms are divided in to 5 minute periods. Transactions are then handled as item sets. The goal is to find correlations among these items in form of association rules. Apriori-based iterative process is proposed to tackle this task. Wu et al. divided association rule mining process in to two phases: 1. find frequent item sets whose support count is greater or equal to predefined minimum support value 2. generate association rules whose confidence is greater or equal to predefined minimum confidence value Both of these steps are implemented using MapReduce to parallelize the computations. In frequent item set mining, the candidate sets are generated in the map function, and the reduce function counts the frequency of the candidate given to it and only outputs the candidate set if it exceeds the support threshold. To generate the association rules, all subsets for every frequent item set must be discovered. To parallelize rule discovery, subsets are generated in the map function, and the reduce function counts the confidence of the rule given to it and only outputs the rule if the confidence is greater than the confidence threshold. Algorithms are presented in detail in [20]. Wu et al. [20] implemented their algorithm on top of Apache Hadoop framework [3], a widely used open-source implementation of MapReduce architecture. They built a synthetic distributed network for mimicking network environment that generated events. Simulated network elements were related to each other so the events had correlations to each other as well. To demonstrate the effectiveness of their algorithm they applied both distributed MapReduce- Apriori and classical centralized Apriori on the data and compared the execution times to each other. They used two similar computers in the distributed implementation to the computer they used with the centralized mining. They tested the algorithms with data of , , , and events. The execution times increased exponentially as the amount of data increased to events. However, despite that one might expect overhead with the distributed solution, execution times were almost halved with every data set by using MapReduce- Apriori when compared to classical Apriori. The implementation Wu et al. proposed shows potential. However, test results with more than just two worker nodes executing the data mining would be very interesting to see. 6.2 Study 2: Overhead of MapReduce Another group of authors, Reguieg et al. [12], more extensive report also available [13], studied event correlation in a bit different context than network alarm data. They used the extensive work of Motahari-Nezhad et al. [10] as a basis of their study. Motahari-Nezhad et al. analysed web service interaction log files to identify correlations among events that belong to the same process execution instance. However, this kind of work for discovering of patterns from an event log is somewhat similar to discovering event patterns from system and service log files in a telecommunication service context as well. Motahari-Nezhad et al. structured the log files in to messages, each containing multiple attributes. Correlated messages are identified by using correlation conditions. In their model, two types of conditions exist: 1) atomic conditions and 2) conjunctive conditions. An atomic condition is met when two attributes of two messages equals to each other. A conjunctive condition is a conjunction of several atomic conditions. They did not use Apriori algorithm but developed their own algorithm for finding, candidate generating and pruning the conditions. Reguieg et al. [12] introduced a MapReduce-based approach to discover these correlation conditions in parallel fashion. They partitioned the logs across the cluster nodes to balance the workload. The implementation had two stages: 1) map and reduce to compute atomic correlation conditions and their associated process instances 2) map and reduce results level-wise to compute conjunctive conditions per process instance. Reguieg et al. [12] experimented their implementation with both real world datasets and randomly generated data with different log sizes. Experiment results can be seen in [13]. Their focus was to observe the overhead of using the MapReduce framework. They executed the tests using Apache Hadoop implementation running on 5 virtual nodes, one master and four workers. Their results show that the overhead of MapReduce is small compared to global gain of performance and scalability. They made two main observations from the results: 1) according to results extending the proposed algorithm with additional map-reduce steps is potentially interesting as it will increase the level of parallelization and hence improving scalability without impacting too negatively on global performances, and 2) the implementation of stage two, discovering conjunctive conditions, can be improved using compact data structures and Apriori-like algorithms to improve the data mining. They pointed out the need for future research for investigating different data partitioning techniques and other kind of implementations that

7 may increase overhead while further improving the scalability of the proposed approach. 6.3 Study 3: Architecture of distributed event correlation discovery Third group of authors, Holtz et al. [6] studied a similar kind of problem in the context of Intrusion Detection. As in network fault management, where event logs are gathered from nodes in network to discover faults, comprising network devices, servers, service logs, network element management software logs, in Intrusion Detection, network traffic, operating system logs and general application data are collected from various sensors around the network to identify intrusions, comprising network devices, servers and user workstations. The data collected from different sources is aggregated, processed and compared for event correlation analysis [6], similarly to that in network fault management. In case of Intrusion Detection, event correlations are discovered to find patterns for attack signatures and malicious activities. Holtz et al. proposed as their main contribution an architecture of a Distributed Intrusion Detection System (DIDS). The system has the following features: 1. Distributed data collection from multiple sources 2. Scalable storage in distributed filesystem infrastructure 3. Scalable distributed event correlation discovery in a Big Data cluster 4. Implemented with open source software tools The architecture was designed to use MapReduce framework. Specifically they used Hadoop implementation of the MapReduce framework for data processing and Hadoop Distributed File System (HDFS) [16, 6] for scalable data storage of the network data. The proposed architecture is illustrated in figure 4. HIDS and NIDS sensors in the figure are Host Intrusion Detection System (HIDS), and Network Intrusion Detection Systems (NIDS). The former being a sensor for individual host and the latter being a sensor listening the network in promiscuous mode. The DIDS architecture proposed by Holtz. et al also provides Web visualization for the results, as seen in the figure. Holtz et al. [6] introduced no algorithms in their paper, just the architecture, and then experimented data storage and processing performance of the cluster. The cluster was composed by one master node and five worker nodes. The first experiment tested filesystem performance and consisted creating varying sized random files. File sizes ranged from 1 Megabytes to 250 Megabytes. The experiment results show that write times increase linearly with the increasing file size. Holtz et al. claimed it to be clear that the system would scale to a large quantity of data. The second experiment tested computational performance by sorting the random data used in the first experiment. It was noted that entropy of random files is high, so sorting the content was claimed to be worst case scenario. The results show that the sorting time remained constant despite the file size, oscillating between Figure 4: Distributed Intrusion Detection System architecture [6] 33.4 and 35.6 seconds for all file sizes. Oscillation occurred due natural oscillation of network performance. The reason for the constant sort time is the constant overhead of the initialization of the Hadoop cluster in preparing the work nodes for the task. Time elapsed in the sorting task itself is insignificant when compared to the synchronization overhead, which is claimed to indicate that the solution scales nicely to different data volumes. The introduced Hadoop-based architecture shows great potential in both storing and processing data in amounts that scale to even massive magnitude if needed. Though the architecture proposed was designed for Intrusion Detection purposes in the paper, the equivalent qualities are needed in any correlation discovery environment and thus makes a significant achievement in the field. 6.4 Summary and evaluation of the solutions Three environments were introduced in the studies presented. All of them used MapReduce to distribute the load, and the studies found the used approach feasible. The greatest difference between the studies was the file system. The first study shared the whole file system among all nodes. All data was accessible for every node. The second study partitioned the data to the nodes. The third was using the HDFS to distribute the data among the nodes. Partitioned or distributed model is more scalable than a centralized one, but the right solution depends on the needs of the system: what kinds of rules are to be discovered from the data. Reguieg et al. [12] introduced a simple algorithm for data processing, and partitioned the data to nodes for efficient parallel computations. But as they noted in their paper, partitioning the data was their biggest difficulty. Partitioning the data is problematic especially if the task is to discover complex correlations among all different types of technologies, in other words, among all the alarms instead of certain group of alarms. The easiest, but most brutal solution to this problem is to have all alarm data accessible to all nodes. Then the alarms can be arbitrarily divided among the nodes, and then assigned to them accordingly. The nodes then use the database of all the alarms to search for alarms that cor-

8 relate with the assigned alarms. Because the correlations are not known a priori, all alarms must then be accessible as candidates for having correlation with the assigned alarm. This database can either be centralized but then cached among the nodes, as Wu et al. proposed, or stored in a distributed file system which is simply accessible to all nodes, as Holtz et al. [6] proposed. The centralized solution however might be a potential bottleneck for scaling system, and also fails to provide the same level of fault tolerance than large scale distributed file system, like HDFS. However HDFS introduces a small overhead and additional network traffic among the nodes. If the collected monitoring data fits into local cache of each node, it is probably best to use the cached solution instead. To discover simple local correlations, however, partitioned data is often a better solution for being most likely faster. The data can be partitioned per node, as Reguieg et al. [12] proposed, or the data can be partitioned to bigger slices temporally close or locally to nodes most likely needing that data, in other words, using temporal locality principle in partitioning the data. For example, when discovering frequent patterns from monitoring data consisting events collected in time period of two months, the data can be partitioned with temporal aspects because it is very unlikely that the alarms in the beginning and end of the data has any correlation among them. For accurate predictive patterns, the correlations must be discovered among events within only a few hours or maximum of few days apart [18]. So, in this case, the data could be partitioned in 4 day segments for example, having total of 15 partitions of data. In addition to temporal partition, monitoring data can be partitioned geographically to sectors: alarms coming from sources 100 kilometers away from each other are most likely not to correlate with each other. The dividing sectors should have overlapping of course, to avoid correlations to be lost in the edges of the sectors. The problem with geographical partitioning is that equipment do not have geographical information of their whereabouts [4], so this information must be provided with additional equipment database. The first study, from Wu et al. [20], used Apriori algorithm, and the second study, from Reguieg et al., also recommended to improve their approach with it. The Apriori is a common algorithm for frequent item set data mining, and is widely already proven to work for data mining the network monitoring data in many centralized solutions, [1, 9, 5] to name few, with the only downside of being computationally intensive. This downside is to some extend mitigated with the presented distributed solution. Another dilemma was the mandatory step of transforming the network monitoring data to transactional form while taking the temporality into account. Wu et al. proposed a simple yet working solution introduced in this paper, and more accurate solutions are available, for example [21], that decreases the temporal bias of the implementation. It is clear that more work is needed in this field. First, studies with centralized data mining solutions in this field claim that Apriori algorithm is far from being the only algorithm for mining the network monitoring data. Studies introducing other parallel data mining algorithms not based on Apriori, that fulfill the requirements of data mining network monitoring data, are needed. They have potential to show better scalability and lower computational requirements. Second, all the existing studies about distributing the data mining of the network monitoring data, and the like, are based on MapReduce framework. While other frameworks are comprehensively studied in other contexts, distributing the data mining of network monitoring data introduces problems a bit unique to this context. Further studies are needed based on other frameworks, like Spark. However, there is a very recent study [11] on distributing the Apriori algorithm with Spark framework. As it is clear that network monitoring data can be transformed to transactional data while taking the temporality into account, and then data mined with Apriori with minor modifications to the algorithm, this proposed Spark solution is most likely convertible to this context as well. Third, there exists no recent public studies that use real life big data scale telephone company data for experiments. In many countries the laws and authority requirements strictly prevent telecommunications customer data to be 1) used in a way that is not related to the tasks required for managing the network and services, and 2) handled with persons that are not required to operate the networks and services. In other words, the network data is very confidential in nature. The telephone companies and big corporations do study solutions, but these commercial solutions are classified as company secrets because of the strategic advance they give in the market, and are therefore not open for public research. 7. CONCLUSION This paper surveyed and then evaluated existing solutions to data mine interesting patterns from the telecommunications network monitoring data in distributed fashion. The paper can be summarized in five steps: First, it was explained that the ability to data mine the network monitoring data is important for telecommunications companies to be able to assure the functionality of the provided services. Second, it was further discussed that the main problem of the data mining are the high computational requirements caused by two factors 1) the amount of monitoring data produced on daily basis is huge 2) monitoring data is needed for a lengthy period of time to provide enough learning data to cover a wide networks. Third, existing studies on distributing the data mining to scalable solutions were surveyed and three important aspects were discovered: 1) a modified Apriori algorithm that supports temporal data, is possible be distributed with MapReduce, 2) the overhead is relatively small, and 3) the data can be either centralized and cached to worker nodes, partitioned to amongst worker nodes, or distributed between the worker nodes, depending on the case. Fourth, the studies were evaluated all to be all proving that the proposed distributed solution works. And finally as fifth, the need for future work was addressed to study other algorithms than Apriori and other frameworks than MapReduce in the context of data mining the network monitoring data. 8. REFERENCES [1] R. Agrawal, R. Srikant, et al. Fast algorithms for mining association rules. In Proc. 20th int. conf. very large data bases, VLDB, volume 1215, pages , [2] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM,

9 51(1): , Jan [3] A. S. Foundation. Apache hadoop project [4] R. Gardner and D. Harle. Fault resolution and alarm correlation in high-speed networks using database mining techniques. In Information, Communications and Signal Processing, ICICS., Proceedings of 1997 International Conference on, volume 3, pages vol.3, sep [5] K. Hatonen, M. Klemettinen, H. Mannila, P. Ronkainen, and H. Toivonen. Tasa: Telecommunication alarm sequence analyzer or how to enjoy faults in your network. In Network Operations and Management Symposium, 1996., IEEE, volume 2, pages vol.2, apr [6] M. D. Holtz, B. M. David, and R. T. de Sousa Junior. Building scalable distributed intrusion detection systems based on the mapreduce framework. Revista Telecomunicações, 2:22 31, [7] L. Li and M. Zhang. The strategy of mining association rule based on cloud computing. In Business Computing and Global Informatization (BCGIN), 2011 International Conference on, pages , July [8] M.-Y. Lin, P.-Y. Lee, and S.-C. Hsueh. Apriori-based frequent itemset mining algorithms on mapreduce. In Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication, ICUIMC 12, pages 76:1 76:8, New York, NY, USA, ACM. [9] H. Mannila, H. Toivonen, and I. A. Verkamo. Discovery of frequent episodes in event sequences. Data Min. Knowl. Discov., 1(3): , Jan [10] H. R. Motahari-Nezhad, R. Saint-Paul, F. Casati, and B. Benatallah. Event correlation for process discovery from web service interaction logs. The VLDB Journal, 20(3): , June [11] H. Qiu, R. Gu, C. Yuan, and Y. Huang. Yafim: A parallel frequent itemset mining algorithm with spark. In Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops, pages IEEE Computer Society, [12] H. Reguieg, F. Toumani, H. Motahari-Nezhad, and B. Benatallah. Using mapreduce to scale events correlation discovery for business processes mining. In A. Barros, A. Gal, and E. Kindler, editors, Business Process Management, volume 7481 of Lecture Notes in Computer Science, pages Springer Berlin Heidelberg, [13] H. Reguieg, F. Toumani, H. Motahari-Nezhad, and B. Benatallah. Using mapreduce to scale events correlation discovery for business processes mining. Technical Report HPL , Hewlett Packard Laboratories, Aug [14] R. Ren, X. Fu, J. Zhan, and W. Zhou. Logmaster: Mining event correlations in logs of large scale cluster systems. arxiv preprint arxiv: , [15] R. K. Sahoo, A. J. Oliner, I. Rish, M. Gupta, J. E. Moreira, S. Ma, R. Vilalta, and A. Sivasubramaniam. Critical event prediction for proactive management in large-scale computer clusters. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 03, pages , New York, NY, USA, ACM. [16] K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, pages 1 10, May [17] R. Vilalta and S. Ma. Predicting rare events in temporal domains. In Data Mining, ICDM Proceedings IEEE International Conference on, pages , [18] G. M. Weiss. Industry: predicting telecommunication equipment failures from sequences of network alarms. In W. Klösgen and J. M. Zytkow, editors, Handbook of data mining and knowledge discovery, pages Oxford University Press, Inc., New York, NY, USA, [19] G. M. Weiss and H. Hirsh. Learning to predict rare events in event sequences. In In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pages AAAI Press, [20] G. Wu, H. Zhang, M. Qiu, Z. Ming, J. Li, and X. Qin. A decentralized approach for mining event correlations in distributed system monitoring. Journal of Parallel and Distributed Computing, 73(3): , Models and Algorithms for High-Performance Distributed Data Mining. [21] Y. Wu, S. Du, and W. Luo. Mining alarm database of telecommunication network for alarm association rules. In Dependable Computing, Proceedings. 11th Pacific Rim International Symposium on, page 6 pp., dec [22] O. Yahya, O. Hegazy, and E. Ezat. An efficient implementation of apriori algorithm based on hadoop-mapreduce model. In Proc. of the, [23] X. Y. Yang, Z. Liu, and Y. Fu. Mapreduce as a programming model for association rules algorithm on hadoop. In Information Sciences and Interaction Sciences (ICIS), rd International Conference on, pages , June 2010.