Correlation discovery from network monitoring data in a big data cluster

Size: px
Start display at page:

Download "Correlation discovery from network monitoring data in a big data cluster"

Transcription

1 Correlation discovery from network monitoring data in a big data cluster Kim Ervasti University of Helsinki kim.ervasti@gmail.com ABSTRACT Monitoring of telecommunications network is a tedious task. Diverse monitoring data is propagated through the network to different kinds of monitoring software solutions. Processing rules are needed for monitoring the data. As manually entering the rules is not an option in today s telecommunication network sizes, automatic data mining inclusive solutions are needed. Existing solutions of data mining the network monitoring data are impacted by performance issues of computationally demanding algorithms. A scalable distributed solution is needed. This paper surveys existing studies on distributed solutions for data mining and processing the network monitoring data. We also summarize and evaluate experiment results of these studies and propose direction for future research. Categories and Subject Descriptors D.4.7 [Organization and Design]: Distributed Systems Keywords Big data, network monitoring, event correlation, decentralized data mining, frequent item set, frequent patterns, assocation rules, mapreduce, apache hadoop 1. INTRODUCTION Telephone companies and telecommunications operators have critical need to monitor their systems: networks and services. Bad quality or even downtime of a service causes loss of money and profits to both the operator and its customers. A telephone company offers broad catalogue of services. For example a telephone company could serve mobile phone networks, broadband networks, core networks, hosting services, web services and nowadays even streaming services like Internet movie rentals, home security services etc., as value adding services. These services are provided on top of mixture of technologies and networks. Also the network equipment rooms and server rooms, the latter often referred as colocation centres, must be monitored for the safety hosting of these equipment. Network monitoring consists of monitoring data, measurements and event messages from the network equipment. To understand the causes and causalities of this received data, the data must be processed. Processing the network monitoring data is referred as correlating. To correlate the monitoring data, it is required to determine what data is important, which alarms are related to each other, and which alarm indicates the real cause for a network issue. These correlation rules are either defined by network experts or generated automatically after discovering the rules by data mining the monitoring data. Telecommunication networks are continuously growing both in size and complexity. Today there is an urgent need for data mining solutions to automate discovering the effective correlation rules. Countless data mining solutions do exist in this field, but they mostly are not scalable to handle the rapidly growing amount of data they should be able to process. This paper surveys the existing studies on scaling up these data mining solutions of network monitoring data to big data scale. The paper is organized as follows: Section 2 presents a quick overview on the architecture behind network monitoring. Section 3 demonstrates how monitoring data is used to detect faults in the network and why is this non-trivial. Section 4 then presents how this problem can be solved and improved by using data mining. Section 5 presents what are the potential issues in scaling up the data mining of network monitoring data. Section 6 surveys the existing studies and evaluates the results of these studies and makes further proposal. Section 7 concludes the paper. 2. NETWORK MONITORING ARCHITEC- TURE Telecommunications operators must monitor their services and networks to assure the functionality of the network and availability of the services provided. All of these networks, consisting different technologies, equipment rooms, server rooms, servers, and services on top of them are monitored by wide spectrum of complex monitoring software solutions. Each monitoring software solution used is specialised to monitor specific part of the network: the monitoring protocols and methods vary depending on what is being monitored. The variety is so great that a common

2 Management Management Events Events Notifications Automations Ticketing Network monitoring system Network devices Radio access network Servers Equipment rooms Figure 1: Network monitoring architecture monitoring software solution to monitor everything is practically impossible to make. Equipment manufacturers tend to make the monitoring software specialised to their equipment in a closed source manner. In addition to monitoring, these software products work also as network element managers required to operate the equipment. Equipment licenses, if nothing else, might prevent using other software for managing as well. Figure 1 illustrates a simplified hierarchy of network monitoring. At the bottom, there are the monitored equipment, referred as network nodes. On the upper level there are the element managers or equipment specific monitoring software solutions monitoring and managing these specific technologies. They are used by network operators - experts on the specific technology in question. These monitoring systems in turn forward the monitoring data to a collective network monitoring system capable of processing this monitoring data on a higher level by combining data and discovering correlations of events from different sources. By processing the data, important network events are detected and commonly these three options exist to make use of the discovered information: 1) customer notifications can be generated to notify customers of the situation, 2) automations can be executed to automatically try to fix the issue, for example resetting the equipment, and 3) a service ticket requesting repairs can be generated to a subcontractor responsible for the equipment. These options are not mutually exclusive. There are different ways to monitor the network. Network equipment, equipment rooms, servers and services are most often monitored from three aspects: 1. by collecting continuous metric measurements data, also known as performance data, 2. by collecting somewhat continuous end-toend (E2E) testing data, and 3. by receiving discrete event messages, referred to as events, from the network nodes. Monitoring software solutions are usually only specialised in one of these ways to monitor the network. Performance data is gathered by collecting numeric data from equipment device sensors, network router transfer rates, amount of point-to-point connections (telephone calls), number of Short Message Service (SMS) messages (also known as text messages) sent per hour, etc. Performance data can be transformed to discrete event data. This is called discretization. Discretization is done by defining threshold values that when exceeded by the measurement value, cause an event to be generated. For example, if the temperature measurement of an CPU increases above 70 C degrees, an alarm with minor severity is created. If the temperature measurement further increases above 80 C, another event with major severity is created. If the temperature decreases back to normal, let s say below 50 C, an acknowledging event with severity of normal is created. End-to-end data is results of various end-to-end tests or measurements. As a few examples, 1) how fast a server or a service is responding from end-to-end, 2) how fast can 1024 kilobytes be transferred through 3G mobile network from device to device, 3) is a specific service responding? The results of these end-to-end tests are also often transformed to events, for example when the test fails (the tested service is not responding) or the measurement result is above threshold (transfer rate is too slow).

3 Event data is about equipment hardware or a monitoring agent on a server sending an event message to the monitoring system of an event detected on the device. A network event is for example a device startup, network interface linkdown, or a measurement data value increasing over a predefined threshold. 3. NETWORK FAULT ISOLATION Network monitoring is about collecting mainly two kinds of equipment state information from the network: 1. state changes as events, and 2. metric information as measurements, as described earlier. Network monitoring systems do not magically get information of the possible problems in the network [4], because the broken equipment might not be able to send alarm events to the monitoring systems or the degraded equipment is unable detect the problem by itself. By gathering measurement and status information, a network monitoring system can collect enough known symptoms to detect the fault. In other words, in a case of a problem in the network or in a service, the monitoring data usually contains symptoms of the fault. A simple example is a broken hard drive in a server. The server sends an alarm to the monitoring system that the hard drive is unreadable. This event is a very clear symptom of a fault situation occurred: a broken hard drive. However it is not always this trivial. As an example of this is the following scenario of a broken network router. In this example, the router broke without a warning, and therefore it did not manage to send any alarm event before the fault occurred and now the router is unable to handle any network traffic. But the monitoring data indicates alarm events coming from other routers in the network saying that network links to this specific router are all degraded or down. From these symptoms we can conclude that the router is in fact faulted. For a fault, there can be different kinds of symptoms depending on the case. Very often multiple different symptoms are observed as a result of a single fault in the network. Event messages are produced based on those symptoms. It is not rare that a single fault in the network actually produces a storm of events. The underlying real reason for the occurred fault, the root cause, is not always obvious even when a fault in the network is detected. In other words the fault in the network is the problem produced by a certain root cause. In this example, the fault is lost or degraded network connectivity around the broken router, and the broken router equipment is the root cause for this fault. Symptoms of the fault are slow or zero transfer speeds or high packet loss when using this degraded network link. Another more complex, but a very concrete correlation example is illustrated in figure 2. In the figure it is shown that there is a fault in connectivity in Multiprotocol Label Switching (MPLS) core network between two Dense wavelength division multiplexing (DWDM) transponders. As an effect, the connectivity to the Provider Edge (PE) router is lost, and a Digital Subscriber Line Access Multiplexer (DSLAM) using that router has lost its connectivity, and all the different equipment using that DSLAM for connectivity are now out of service: a small 3G base station site, and several corporal networks. All three of these types of equipment, DSLAM, 3G site and Corporal networks, are under active monitoring of an element manager or monitoring software, so it is quickly discovered that the connection to those services is lost. Alarms are generated from all of these systems individually. The monitoring of the MPLS core network also creates an alarm from the lost link to one of the DWDM transponders. Although a total of four alarms are being generated, only one of them is directly indicating to the device needing repairs. It is however non-trivial to discover that these four alarms are actually related to each other when received in a central network monitoring system, seen in figure 1. Finding the root cause for the fault requires root cause analysis based on the available monitoring data of the symptoms. The monitoring data must be combined and correlated. By correlating the network monitoring data we can find which events and measurement data are related to a certain problem and which data is not. By combining this problem related data, we can further track down the root cause for the problem. For example, if we have observations that we have lost a connection to exactly all the base stations under a certain Base Station Controller (BSC), we can conclude from this that the problem is not any of the base stations individually, but either the BSC or the transmission link to the BSC. Transmission links provides the transfer of the phone calls, text messages and mobile data between the BSC and the operator, and also the management communication connection to the BSC, but as we do not observe any alarm event of this specific transmission link that provides the connectivity to this BSC, we can conclude that the problem is BSC being broken. To discover the root cause this way, correlation between the alarms must be discovered first. That is, which received alarms or measurements are associated with each other, and which ones are not. For example, if we receive two alarms from two different Base Stations, it is not known automatically whether those alarms have real correlation to each other or not. With data mining approaches correlation information can be discovered as event correlation patterns [20]. 4. DATA MINING OF THE ALARM DATA Identifying the root cause for a network fault, the alarms related to the fault must be discovered. This can be done if the network topology is known: a faulted device obviously affects the devices topologically related to the faulted device. Unfortunately in a big telephone company the network topology is rarely known in a format that could be used in root cause analysis. This is especially true if the alarms related to the fault are emitted from several different types of networks. Multiple topologies of different types of networks would be needed and the connections between them would be needed to be known. Services served on top of physical network, or often a combination of different types of networks, are rarely mapped to the underlying network as a combined topology. As services are often mapped to logical hosts and connections instead of the physical ones, linking them all together as a single combined topology is a complex task. If the topology is not available, the effective relations among network equipment must be data mined. Because of the huge networks of telecommunications companies, this approach requires lots of learning data being provided from a lengthy time period. This makes the data size very big and the data mining therefore computationally very time consuming task.

4 MPLS Core 3G site Alarm DWDM Fault DWDM Provider Edge router DSLAM Alarm Corporal Networks Alarm Figure 2: Fault impact in the network Discovered patterns of network events and measurements can also be used for pre-emptive monitoring of the network [18]. Proactive network management in the form of failure prediction is a hot topic in today s telephone companies. Low severity alarms, warnings and information about routine state changes, are continuously received from network devices. Rarely they indicate a real problem occurring in the near future. Which low severity alarms are an indication of an occurring problem, and with what probability, is not known a priori. Unfortunately, even equipment manufacturers themselves usually do not know this either. That is why it has been a trend to discard these alarms all together. Reacting to everyone of them is not possible when the network is in telephone company scale. A low severity alarm or a group of alarms predicting a real problem is a statistic that can be discovered by mining monitoring data, if enough learning data is present. Ren et al. [14] put it a bit differently: if we can find a statistical rule that identifies a correlation between warning events and a fatal event, the occurrence of those warning events indicates the elevated probability of the fatal event occurring in the near future. The figure 3 uses graphical form to illustrate two different, but related predictive patterns discovered from event data. In the example, three different events about symptoms 1, 2 and 3, received in this order, are known to precede a network fault with high probability. Also two events about symptoms 1 and 2 in this order are known to precede the same fault, but with lower probability as shown with edge D in the figure. Symptom 1 alone does not seem to indicate any fault. In general, patterns of low severity events leading to severe fault situations do exist in the network monitoring data. When combined with network measurement data, even more useful patterns exist. As the patterns exist, they can also be discovered with right tools. Discovering these hidden patterns in the data is the basis of fault prediction. Because of the massive amount of the data, instead of human work data mining approaches must be applied to find them. In terms of data mining, alarm correlations are regarded as alarm association rules [20]. Association rules are discovered by searching frequent item sets from the data. The found frequent item sets can then be transformed to association rules. Frequent item sets are sets of one or more items, discovered from the data, that occur in data frequently. The occur- A B C Symptom 1 Symptom 2 Symptom 3 Problem Figure 3: Pre-emptive pattern rence frequency of an item set is called as the support value of the item set. Support value is defined as the amount of transactions that have the item set divided by the amount of all transactions. The threshold for an item set being frequent is a user predefined value, called as minimum support value. Frequent item sets are often data mined from transactional data. An easy and typical real life example of transactional data is market purchase data: collections of shopping items bought together by customers. In this example, item sets present sets of merchandise that are often bought together. The association rules present causalities in frequent item sets. For example, if bread and butter are frequently bought together, and therefore can be discovered as being a frequent item set, an association rule saying that customers that buy bread, also buy butter often could be discovered. Discovering association rules is very useful for finding predictive patterns from the network monitoring data. Figure 3 could present the following association rule: if symptoms 1, 2, 3 are detected from the network, the Problem will occur. A confidence value is attached to each association rule. The value presents the probability that the rule holds. The confidence of a rule is rarely 100%. For example, some people might want to eat their bread without butter, so they buy their bread without it. Therefore the rule could have confidence of 80%. In the figure 3, the confidence of the edge D presented, saying if symptoms 1 and 2 occur, the Problem will occur has a lower confidence than the rule of detecting all the symptoms first, the edge C. Association rules are generated from discovered frequent item sets. In addition to frequent item set discovery, rare events, like catastrophic failures, which obviously do not occur frequently, can also be of interest. In that case, rare item sets are searched instead [18, 19, 17, 15]. However, the rare tar- D Time

5 get event must be known and specified. Data mining is then applied to discover the predictive pattern leading to it. After performing either rare or frequent item set data mining and transforming them to association rules, a confidence can be calculated for each rule. After discovery of such patterns and rules, high confidence rules can be selected to be searched from the new incoming alarm data stream in real time. After detecting a such discovered pattern, a new high severity alarm can be generated in advance with additional information about the root cause before the fault has actually occurred. Depending on whether the fault can actually be prevented or not, either pre-emptive actions or preparations for repairments can be made. 5. CHALLENGES IN DATA MINING THE NETWORK MONITORING DATA Apriori algorithm is very commonly proposed and used for mining frequent item sets and discovering association rules from transactional data [1]. The Apriori works as follows: First, it finds all the single items in the data, that are frequent, in other words, the support value of the item is greater than the predefined minimum support value. Second, it then generates pairs from all the frequent items discovered in step one. Third, the occurrences of these candidate pairs in the data is then counted. The candidates having a support value lower than the minimum support, are pruned. After this step, frequent item sets of two items have been discovered. Fourth, then item sets of three items are generated among the frequent pairs and items as new candidate item sets of size three, and then counted and pruned similarly to candidate pairs. This procedure is continued by raising the size of the item set until no item set is more frequent than the minimum support value defines. All the item sets discovered to be more frequent than the minimum support value during the runtime of the algorithm, are frequent item sets. The length of the item sets vary, but of course the interest can be outlined to specific length of sets. The classic Apriori algorithm suffers from exponential exploring space in mining the frequent item sets [20] and it needs to scan through the data multiple times. Telecommunications networks are often huge and evergrowing not just in size, but also in complexity and variety of different technologies used. The amount of measurement data in telecommunications network monitoring is daily in terabyte scale. Event data is smaller than measurement data, hundreds of megabytes or gigabytes per day, but as discovery of frequent sets is computationally expensive [20], even this amount of data introduces challenges for centralized data mining. Both measurement and event data combined makes the data easily to big data scale. Therefore, to find the patterns in reasonable time, the data mining must be scalable [22]. Apriori algorithm is incapable to discover the temporal relationship among alarms [20]. Temporality and timing order of alarms has a central role in correlations [20, 14]. Thus the Apriori algorithm must be modified to work in parallel in a distributed platform and to take temporality into account. 6. SURVEY OF PROPOSED SOLUTIONS Distributing Apriori with MapReduce [2, 6] is widely studied and proposed as implementations. MapReduce is a programming model and implementation for processing massive data sets in parallel in a distributed fashion. Users define only the map and reduce functions and the underlying framework automatically handles the distribution of the computation across the cluster, hardware failures and load balancing. Just to name few important groups of contributors on MapReduce distributed Apriori [22, 8, 7, 23]. However, only very few authors have addressed the problem also in the context of network event correlation, so the algorithms lack the ability to take temporality into account. Other big data scaled data mining solutions exist to discover patterns from for example market purchase data, medical data and financial data, but this solutions also lack the temporal, timing order or data dimensionality capabilities network monitoring data requires [14]. As described above, network events have an important temporal relationship which also has to be taken into account. This also excludes the possibility to make use of the considerable amount of studies made for big data solutions to data mine market purchase data, medical data financial data. 6.1 Study 1: Distributed Apriori Wu et al. [20] proposed the following interesting features in their study: 1. MapReduce-Apriori algorithm, a MapReduce based implementation of Apriori to discover frequent item sets in parallel. 2. translation of network events into transaction form taking also temporality of events into account. 3. decentralized solution of the database to make parallel data mining possible. 4. an approach to filter irrelevant events locally to reduce the total amount of alarm data and further improve the performance. The problem with data mining the network monitoring data in parallel is in distributing the alarm data. As the correlations among alarms are not known a priori, there is no way of knowing which alarms are needed at which mining node either. For this, Wu et al. proposed solution of shared file system between the nodes, the Global Distributed File System (GDFS). Each node uses their local storage as a cache and can upload and download files to and from the GDFS. Wu et al. [20] processed all monitoring data into events. Discretizating the measurement data counts on predefined threshold values, as described earlier. Because the thresholds are defined by an expert, a risk exists for losing interesting anomalies which might pose relevant information for patterns when combined with other event data, if the thresholds are set too high or low by misunderstanding the behaviour of the equipment, for example. On the other hand, discretizating the measurement data makes the data much easier to apply for data mining and also radically reduces the size of the data. Wu et al. [20] transformed the collected events into transactions. Their monitoring environment processes events only in durations of five minutes. This simplifies dividing the

6 TID Time slot Transaction 1 00:00 Node 1: event 1, Node 2: event 2, Node 2: event 3, Node 3: event :05 Node 1: event 1, Node 2: event 2, Node 2: event :10 Node 1: event 1, Node 2: event 3, Node 3: event 2, Node 4: event :15 Node 2: event 3, Node 3: event 2 Table 1: Example of events transformed in to transactions alarms in to transactions, but in real world use it also introduces unwanted bias to the temporal information of the events. An example of alarms divided in to transactions is illustrated in table 1. In this simplified example only 4 different nodes, numbered from 1 to 4, have sent events and the events only have one attribute: an alarm identification number. Alarms are divided in to 5 minute periods. Transactions are then handled as item sets. The goal is to find correlations among these items in form of association rules. Apriori-based iterative process is proposed to tackle this task. Wu et al. divided association rule mining process in to two phases: 1. find frequent item sets whose support count is greater or equal to predefined minimum support value 2. generate association rules whose confidence is greater or equal to predefined minimum confidence value Both of these steps are implemented using MapReduce to parallelize the computations. In frequent item set mining, the candidate sets are generated in the map function, and the reduce function counts the frequency of the candidate given to it and only outputs the candidate set if it exceeds the support threshold. To generate the association rules, all subsets for every frequent item set must be discovered. To parallelize rule discovery, subsets are generated in the map function, and the reduce function counts the confidence of the rule given to it and only outputs the rule if the confidence is greater than the confidence threshold. Algorithms are presented in detail in [20]. Wu et al. [20] implemented their algorithm on top of Apache Hadoop framework [3], a widely used open-source implementation of MapReduce architecture. They built a synthetic distributed network for mimicking network environment that generated events. Simulated network elements were related to each other so the events had correlations to each other as well. To demonstrate the effectiveness of their algorithm they applied both distributed MapReduce- Apriori and classical centralized Apriori on the data and compared the execution times to each other. They used two similar computers in the distributed implementation to the computer they used with the centralized mining. They tested the algorithms with data of , , , and events. The execution times increased exponentially as the amount of data increased to events. However, despite that one might expect overhead with the distributed solution, execution times were almost halved with every data set by using MapReduce- Apriori when compared to classical Apriori. The implementation Wu et al. proposed shows potential. However, test results with more than just two worker nodes executing the data mining would be very interesting to see. 6.2 Study 2: Overhead of MapReduce Another group of authors, Reguieg et al. [12], more extensive report also available [13], studied event correlation in a bit different context than network alarm data. They used the extensive work of Motahari-Nezhad et al. [10] as a basis of their study. Motahari-Nezhad et al. analysed web service interaction log files to identify correlations among events that belong to the same process execution instance. However, this kind of work for discovering of patterns from an event log is somewhat similar to discovering event patterns from system and service log files in a telecommunication service context as well. Motahari-Nezhad et al. structured the log files in to messages, each containing multiple attributes. Correlated messages are identified by using correlation conditions. In their model, two types of conditions exist: 1) atomic conditions and 2) conjunctive conditions. An atomic condition is met when two attributes of two messages equals to each other. A conjunctive condition is a conjunction of several atomic conditions. They did not use Apriori algorithm but developed their own algorithm for finding, candidate generating and pruning the conditions. Reguieg et al. [12] introduced a MapReduce-based approach to discover these correlation conditions in parallel fashion. They partitioned the logs across the cluster nodes to balance the workload. The implementation had two stages: 1) map and reduce to compute atomic correlation conditions and their associated process instances 2) map and reduce results level-wise to compute conjunctive conditions per process instance. Reguieg et al. [12] experimented their implementation with both real world datasets and randomly generated data with different log sizes. Experiment results can be seen in [13]. Their focus was to observe the overhead of using the MapReduce framework. They executed the tests using Apache Hadoop implementation running on 5 virtual nodes, one master and four workers. Their results show that the overhead of MapReduce is small compared to global gain of performance and scalability. They made two main observations from the results: 1) according to results extending the proposed algorithm with additional map-reduce steps is potentially interesting as it will increase the level of parallelization and hence improving scalability without impacting too negatively on global performances, and 2) the implementation of stage two, discovering conjunctive conditions, can be improved using compact data structures and Apriori-like algorithms to improve the data mining. They pointed out the need for future research for investigating different data partitioning techniques and other kind of implementations that

7 may increase overhead while further improving the scalability of the proposed approach. 6.3 Study 3: Architecture of distributed event correlation discovery Third group of authors, Holtz et al. [6] studied a similar kind of problem in the context of Intrusion Detection. As in network fault management, where event logs are gathered from nodes in network to discover faults, comprising network devices, servers, service logs, network element management software logs, in Intrusion Detection, network traffic, operating system logs and general application data are collected from various sensors around the network to identify intrusions, comprising network devices, servers and user workstations. The data collected from different sources is aggregated, processed and compared for event correlation analysis [6], similarly to that in network fault management. In case of Intrusion Detection, event correlations are discovered to find patterns for attack signatures and malicious activities. Holtz et al. proposed as their main contribution an architecture of a Distributed Intrusion Detection System (DIDS). The system has the following features: 1. Distributed data collection from multiple sources 2. Scalable storage in distributed filesystem infrastructure 3. Scalable distributed event correlation discovery in a Big Data cluster 4. Implemented with open source software tools The architecture was designed to use MapReduce framework. Specifically they used Hadoop implementation of the MapReduce framework for data processing and Hadoop Distributed File System (HDFS) [16, 6] for scalable data storage of the network data. The proposed architecture is illustrated in figure 4. HIDS and NIDS sensors in the figure are Host Intrusion Detection System (HIDS), and Network Intrusion Detection Systems (NIDS). The former being a sensor for individual host and the latter being a sensor listening the network in promiscuous mode. The DIDS architecture proposed by Holtz. et al also provides Web visualization for the results, as seen in the figure. Holtz et al. [6] introduced no algorithms in their paper, just the architecture, and then experimented data storage and processing performance of the cluster. The cluster was composed by one master node and five worker nodes. The first experiment tested filesystem performance and consisted creating varying sized random files. File sizes ranged from 1 Megabytes to 250 Megabytes. The experiment results show that write times increase linearly with the increasing file size. Holtz et al. claimed it to be clear that the system would scale to a large quantity of data. The second experiment tested computational performance by sorting the random data used in the first experiment. It was noted that entropy of random files is high, so sorting the content was claimed to be worst case scenario. The results show that the sorting time remained constant despite the file size, oscillating between Figure 4: Distributed Intrusion Detection System architecture [6] 33.4 and 35.6 seconds for all file sizes. Oscillation occurred due natural oscillation of network performance. The reason for the constant sort time is the constant overhead of the initialization of the Hadoop cluster in preparing the work nodes for the task. Time elapsed in the sorting task itself is insignificant when compared to the synchronization overhead, which is claimed to indicate that the solution scales nicely to different data volumes. The introduced Hadoop-based architecture shows great potential in both storing and processing data in amounts that scale to even massive magnitude if needed. Though the architecture proposed was designed for Intrusion Detection purposes in the paper, the equivalent qualities are needed in any correlation discovery environment and thus makes a significant achievement in the field. 6.4 Summary and evaluation of the solutions Three environments were introduced in the studies presented. All of them used MapReduce to distribute the load, and the studies found the used approach feasible. The greatest difference between the studies was the file system. The first study shared the whole file system among all nodes. All data was accessible for every node. The second study partitioned the data to the nodes. The third was using the HDFS to distribute the data among the nodes. Partitioned or distributed model is more scalable than a centralized one, but the right solution depends on the needs of the system: what kinds of rules are to be discovered from the data. Reguieg et al. [12] introduced a simple algorithm for data processing, and partitioned the data to nodes for efficient parallel computations. But as they noted in their paper, partitioning the data was their biggest difficulty. Partitioning the data is problematic especially if the task is to discover complex correlations among all different types of technologies, in other words, among all the alarms instead of certain group of alarms. The easiest, but most brutal solution to this problem is to have all alarm data accessible to all nodes. Then the alarms can be arbitrarily divided among the nodes, and then assigned to them accordingly. The nodes then use the database of all the alarms to search for alarms that cor-

8 relate with the assigned alarms. Because the correlations are not known a priori, all alarms must then be accessible as candidates for having correlation with the assigned alarm. This database can either be centralized but then cached among the nodes, as Wu et al. proposed, or stored in a distributed file system which is simply accessible to all nodes, as Holtz et al. [6] proposed. The centralized solution however might be a potential bottleneck for scaling system, and also fails to provide the same level of fault tolerance than large scale distributed file system, like HDFS. However HDFS introduces a small overhead and additional network traffic among the nodes. If the collected monitoring data fits into local cache of each node, it is probably best to use the cached solution instead. To discover simple local correlations, however, partitioned data is often a better solution for being most likely faster. The data can be partitioned per node, as Reguieg et al. [12] proposed, or the data can be partitioned to bigger slices temporally close or locally to nodes most likely needing that data, in other words, using temporal locality principle in partitioning the data. For example, when discovering frequent patterns from monitoring data consisting events collected in time period of two months, the data can be partitioned with temporal aspects because it is very unlikely that the alarms in the beginning and end of the data has any correlation among them. For accurate predictive patterns, the correlations must be discovered among events within only a few hours or maximum of few days apart [18]. So, in this case, the data could be partitioned in 4 day segments for example, having total of 15 partitions of data. In addition to temporal partition, monitoring data can be partitioned geographically to sectors: alarms coming from sources 100 kilometers away from each other are most likely not to correlate with each other. The dividing sectors should have overlapping of course, to avoid correlations to be lost in the edges of the sectors. The problem with geographical partitioning is that equipment do not have geographical information of their whereabouts [4], so this information must be provided with additional equipment database. The first study, from Wu et al. [20], used Apriori algorithm, and the second study, from Reguieg et al., also recommended to improve their approach with it. The Apriori is a common algorithm for frequent item set data mining, and is widely already proven to work for data mining the network monitoring data in many centralized solutions, [1, 9, 5] to name few, with the only downside of being computationally intensive. This downside is to some extend mitigated with the presented distributed solution. Another dilemma was the mandatory step of transforming the network monitoring data to transactional form while taking the temporality into account. Wu et al. proposed a simple yet working solution introduced in this paper, and more accurate solutions are available, for example [21], that decreases the temporal bias of the implementation. It is clear that more work is needed in this field. First, studies with centralized data mining solutions in this field claim that Apriori algorithm is far from being the only algorithm for mining the network monitoring data. Studies introducing other parallel data mining algorithms not based on Apriori, that fulfill the requirements of data mining network monitoring data, are needed. They have potential to show better scalability and lower computational requirements. Second, all the existing studies about distributing the data mining of the network monitoring data, and the like, are based on MapReduce framework. While other frameworks are comprehensively studied in other contexts, distributing the data mining of network monitoring data introduces problems a bit unique to this context. Further studies are needed based on other frameworks, like Spark. However, there is a very recent study [11] on distributing the Apriori algorithm with Spark framework. As it is clear that network monitoring data can be transformed to transactional data while taking the temporality into account, and then data mined with Apriori with minor modifications to the algorithm, this proposed Spark solution is most likely convertible to this context as well. Third, there exists no recent public studies that use real life big data scale telephone company data for experiments. In many countries the laws and authority requirements strictly prevent telecommunications customer data to be 1) used in a way that is not related to the tasks required for managing the network and services, and 2) handled with persons that are not required to operate the networks and services. In other words, the network data is very confidential in nature. The telephone companies and big corporations do study solutions, but these commercial solutions are classified as company secrets because of the strategic advance they give in the market, and are therefore not open for public research. 7. CONCLUSION This paper surveyed and then evaluated existing solutions to data mine interesting patterns from the telecommunications network monitoring data in distributed fashion. The paper can be summarized in five steps: First, it was explained that the ability to data mine the network monitoring data is important for telecommunications companies to be able to assure the functionality of the provided services. Second, it was further discussed that the main problem of the data mining are the high computational requirements caused by two factors 1) the amount of monitoring data produced on daily basis is huge 2) monitoring data is needed for a lengthy period of time to provide enough learning data to cover a wide networks. Third, existing studies on distributing the data mining to scalable solutions were surveyed and three important aspects were discovered: 1) a modified Apriori algorithm that supports temporal data, is possible be distributed with MapReduce, 2) the overhead is relatively small, and 3) the data can be either centralized and cached to worker nodes, partitioned to amongst worker nodes, or distributed between the worker nodes, depending on the case. Fourth, the studies were evaluated all to be all proving that the proposed distributed solution works. And finally as fifth, the need for future work was addressed to study other algorithms than Apriori and other frameworks than MapReduce in the context of data mining the network monitoring data. 8. REFERENCES [1] R. Agrawal, R. Srikant, et al. Fast algorithms for mining association rules. In Proc. 20th int. conf. very large data bases, VLDB, volume 1215, pages , [2] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM,

9 51(1): , Jan [3] A. S. Foundation. Apache hadoop project [4] R. Gardner and D. Harle. Fault resolution and alarm correlation in high-speed networks using database mining techniques. In Information, Communications and Signal Processing, ICICS., Proceedings of 1997 International Conference on, volume 3, pages vol.3, sep [5] K. Hatonen, M. Klemettinen, H. Mannila, P. Ronkainen, and H. Toivonen. Tasa: Telecommunication alarm sequence analyzer or how to enjoy faults in your network. In Network Operations and Management Symposium, 1996., IEEE, volume 2, pages vol.2, apr [6] M. D. Holtz, B. M. David, and R. T. de Sousa Junior. Building scalable distributed intrusion detection systems based on the mapreduce framework. Revista Telecomunicações, 2:22 31, [7] L. Li and M. Zhang. The strategy of mining association rule based on cloud computing. In Business Computing and Global Informatization (BCGIN), 2011 International Conference on, pages , July [8] M.-Y. Lin, P.-Y. Lee, and S.-C. Hsueh. Apriori-based frequent itemset mining algorithms on mapreduce. In Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication, ICUIMC 12, pages 76:1 76:8, New York, NY, USA, ACM. [9] H. Mannila, H. Toivonen, and I. A. Verkamo. Discovery of frequent episodes in event sequences. Data Min. Knowl. Discov., 1(3): , Jan [10] H. R. Motahari-Nezhad, R. Saint-Paul, F. Casati, and B. Benatallah. Event correlation for process discovery from web service interaction logs. The VLDB Journal, 20(3): , June [11] H. Qiu, R. Gu, C. Yuan, and Y. Huang. Yafim: A parallel frequent itemset mining algorithm with spark. In Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops, pages IEEE Computer Society, [12] H. Reguieg, F. Toumani, H. Motahari-Nezhad, and B. Benatallah. Using mapreduce to scale events correlation discovery for business processes mining. In A. Barros, A. Gal, and E. Kindler, editors, Business Process Management, volume 7481 of Lecture Notes in Computer Science, pages Springer Berlin Heidelberg, [13] H. Reguieg, F. Toumani, H. Motahari-Nezhad, and B. Benatallah. Using mapreduce to scale events correlation discovery for business processes mining. Technical Report HPL , Hewlett Packard Laboratories, Aug [14] R. Ren, X. Fu, J. Zhan, and W. Zhou. Logmaster: Mining event correlations in logs of large scale cluster systems. arxiv preprint arxiv: , [15] R. K. Sahoo, A. J. Oliner, I. Rish, M. Gupta, J. E. Moreira, S. Ma, R. Vilalta, and A. Sivasubramaniam. Critical event prediction for proactive management in large-scale computer clusters. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 03, pages , New York, NY, USA, ACM. [16] K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, pages 1 10, May [17] R. Vilalta and S. Ma. Predicting rare events in temporal domains. In Data Mining, ICDM Proceedings IEEE International Conference on, pages , [18] G. M. Weiss. Industry: predicting telecommunication equipment failures from sequences of network alarms. In W. Klösgen and J. M. Zytkow, editors, Handbook of data mining and knowledge discovery, pages Oxford University Press, Inc., New York, NY, USA, [19] G. M. Weiss and H. Hirsh. Learning to predict rare events in event sequences. In In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pages AAAI Press, [20] G. Wu, H. Zhang, M. Qiu, Z. Ming, J. Li, and X. Qin. A decentralized approach for mining event correlations in distributed system monitoring. Journal of Parallel and Distributed Computing, 73(3): , Models and Algorithms for High-Performance Distributed Data Mining. [21] Y. Wu, S. Du, and W. Luo. Mining alarm database of telecommunication network for alarm association rules. In Dependable Computing, Proceedings. 11th Pacific Rim International Symposium on, page 6 pp., dec [22] O. Yahya, O. Hegazy, and E. Ezat. An efficient implementation of apriori algorithm based on hadoop-mapreduce model. In Proc. of the, [23] X. Y. Yang, Z. Liu, and Y. Fu. Mapreduce as a programming model for association rules algorithm on hadoop. In Information Sciences and Interaction Sciences (ICIS), rd International Conference on, pages , June 2010.

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Detection of Distributed Denial of Service Attack with Hadoop on Live Network Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,

More information

Secret Sharing based on XOR for Efficient Data Recovery in Cloud

Secret Sharing based on XOR for Efficient Data Recovery in Cloud Secret Sharing based on XOR for Efficient Data Recovery in Cloud Computing Environment Su-Hyun Kim, Im-Yeong Lee, First Author Division of Computer Software Engineering, Soonchunhyang University, kimsh@sch.ac.kr

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

Distributed Apriori in Hadoop MapReduce Framework

Distributed Apriori in Hadoop MapReduce Framework Distributed Apriori in Hadoop MapReduce Framework By Shulei Zhao (sz2352) and Rongxin Du (rd2537) Individual Contribution: Shulei Zhao: Implements centralized Apriori algorithm and input preprocessing

More information

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

More information

Dual Mechanism to Detect DDOS Attack Priyanka Dembla, Chander Diwaker 2 1 Research Scholar, 2 Assistant Professor

Dual Mechanism to Detect DDOS Attack Priyanka Dembla, Chander Diwaker 2 1 Research Scholar, 2 Assistant Professor International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) International Journal of Engineering, Business and Enterprise

More information

Efficient Data Replication Scheme based on Hadoop Distributed File System

Efficient Data Replication Scheme based on Hadoop Distributed File System , pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,

More information

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

More information

A Review of Anomaly Detection Techniques in Network Intrusion Detection System

A Review of Anomaly Detection Techniques in Network Intrusion Detection System A Review of Anomaly Detection Techniques in Network Intrusion Detection System Dr.D.V.S.S.Subrahmanyam Professor, Dept. of CSE, Sreyas Institute of Engineering & Technology, Hyderabad, India ABSTRACT:In

More information

http://www.paper.edu.cn

http://www.paper.edu.cn 5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission

More information

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health Lecture 1: Data Mining Overview and Process What is data mining? Example applications Definitions Multi disciplinary Techniques Major challenges The data mining process History of data mining Data mining

More information

The WAMS Power Data Processing based on Hadoop

The WAMS Power Data Processing based on Hadoop Proceedings of 2012 4th International Conference on Machine Learning and Computing IPCSIT vol. 25 (2012) (2012) IACSIT Press, Singapore The WAMS Power Data Processing based on Hadoop Zhaoyang Qu 1, Shilin

More information

ISSN: 2320-1363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

ISSN: 2320-1363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS A.Divya *1, A.M.Saravanan *2, I. Anette Regina *3 MPhil, Research Scholar, Muthurangam Govt. Arts College, Vellore, Tamilnadu, India Assistant

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

Selection of Optimal Discount of Retail Assortments with Data Mining Approach

Selection of Optimal Discount of Retail Assortments with Data Mining Approach Available online at www.interscience.in Selection of Optimal Discount of Retail Assortments with Data Mining Approach Padmalatha Eddla, Ravinder Reddy, Mamatha Computer Science Department,CBIT, Gandipet,Hyderabad,A.P,India.

More information

Big Data Mining Services and Knowledge Discovery Applications on Clouds

Big Data Mining Services and Knowledge Discovery Applications on Clouds Big Data Mining Services and Knowledge Discovery Applications on Clouds Domenico Talia DIMES, Università della Calabria & DtoK Lab Italy talia@dimes.unical.it Data Availability or Data Deluge? Some decades

More information

Policy-based Pre-Processing in Hadoop

Policy-based Pre-Processing in Hadoop Policy-based Pre-Processing in Hadoop Yi Cheng, Christian Schaefer Ericsson Research Stockholm, Sweden yi.cheng@ericsson.com, christian.schaefer@ericsson.com Abstract While big data analytics provides

More information

Binary Coded Web Access Pattern Tree in Education Domain

Binary Coded Web Access Pattern Tree in Education Domain Binary Coded Web Access Pattern Tree in Education Domain C. Gomathi P.G. Department of Computer Science Kongu Arts and Science College Erode-638-107, Tamil Nadu, India E-mail: kc.gomathi@gmail.com M. Moorthi

More information

Data Mining in the Swamp

Data Mining in the Swamp WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all

More information

Hadoop Technology for Flow Analysis of the Internet Traffic

Hadoop Technology for Flow Analysis of the Internet Traffic Hadoop Technology for Flow Analysis of the Internet Traffic Rakshitha Kiran P PG Scholar, Dept. of C.S, Shree Devi Institute of Technology, Mangalore, Karnataka, India ABSTRACT: Flow analysis of the internet

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

AUTOMATED AND ADAPTIVE DOWNLOAD SERVICE USING P2P APPROACH IN CLOUD

AUTOMATED AND ADAPTIVE DOWNLOAD SERVICE USING P2P APPROACH IN CLOUD IMPACT: International Journal of Research in Engineering & Technology (IMPACT: IJRET) ISSN(E): 2321-8843; ISSN(P): 2347-4599 Vol. 2, Issue 4, Apr 2014, 63-68 Impact Journals AUTOMATED AND ADAPTIVE DOWNLOAD

More information

Distributed Framework for Data Mining As a Service on Private Cloud

Distributed Framework for Data Mining As a Service on Private Cloud RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &

More information

How To Balance In Cloud Computing

How To Balance In Cloud Computing A Review on Load Balancing Algorithms in Cloud Hareesh M J Dept. of CSE, RSET, Kochi hareeshmjoseph@ gmail.com John P Martin Dept. of CSE, RSET, Kochi johnpm12@gmail.com Yedhu Sastri Dept. of IT, RSET,

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

Service Monitoring and Alarm Correlations

Service Monitoring and Alarm Correlations Service Monitoring and Alarm Correlations Oliver Jukić Virovitica College Virovitica, Republic of Croatia oliver.jukic@vsmti.hr Ivan Heđi Virovitica College Virovitica, Republic of Croatia ivan.hedi@vsmti.hr

More information

HYBRID INTRUSION DETECTION FOR CLUSTER BASED WIRELESS SENSOR NETWORK

HYBRID INTRUSION DETECTION FOR CLUSTER BASED WIRELESS SENSOR NETWORK HYBRID INTRUSION DETECTION FOR CLUSTER BASED WIRELESS SENSOR NETWORK 1 K.RANJITH SINGH 1 Dept. of Computer Science, Periyar University, TamilNadu, India 2 T.HEMA 2 Dept. of Computer Science, Periyar University,

More information

Mining Interesting Medical Knowledge from Big Data

Mining Interesting Medical Knowledge from Big Data IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 1, Ver. II (Jan Feb. 2016), PP 06-10 www.iosrjournals.org Mining Interesting Medical Knowledge from

More information

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

More information

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES 1 MYOUNGJIN KIM, 2 CUI YUN, 3 SEUNGHO HAN, 4 HANKU LEE 1,2,3,4 Department of Internet & Multimedia Engineering,

More information

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image

More information

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge Global Journal of Business Management and Information Technology. Volume 1, Number 2 (2011), pp. 85-93 Research India Publications http://www.ripublication.com Static Data Mining Algorithm with Progressive

More information

UPS battery remote monitoring system in cloud computing

UPS battery remote monitoring system in cloud computing , pp.11-15 http://dx.doi.org/10.14257/astl.2014.53.03 UPS battery remote monitoring system in cloud computing Shiwei Li, Haiying Wang, Qi Fan School of Automation, Harbin University of Science and Technology

More information

Process Intelligence: An Exciting New Frontier for Business Intelligence

Process Intelligence: An Exciting New Frontier for Business Intelligence February/2014 Process Intelligence: An Exciting New Frontier for Business Intelligence Claudia Imhoff, Ph.D. Sponsored by Altosoft, A Kofax Company Table of Contents Introduction... 1 Use Cases... 2 Business

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,

More information

Elastic Application Platform for Market Data Real-Time Analytics. for E-Commerce

Elastic Application Platform for Market Data Real-Time Analytics. for E-Commerce Elastic Application Platform for Market Data Real-Time Analytics Can you deliver real-time pricing, on high-speed market data, for real-time critical for E-Commerce decisions? Market Data Analytics applications

More information

DEVELOPMENT OF HASH TABLE BASED WEB-READY DATA MINING ENGINE

DEVELOPMENT OF HASH TABLE BASED WEB-READY DATA MINING ENGINE DEVELOPMENT OF HASH TABLE BASED WEB-READY DATA MINING ENGINE SK MD OBAIDULLAH Department of Computer Science & Engineering, Aliah University, Saltlake, Sector-V, Kol-900091, West Bengal, India sk.obaidullah@gmail.com

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

More information

A NOVEL OVERLAY IDS FOR WIRELESS SENSOR NETWORKS

A NOVEL OVERLAY IDS FOR WIRELESS SENSOR NETWORKS A NOVEL OVERLAY IDS FOR WIRELESS SENSOR NETWORKS Sumanta Saha, Md. Safiqul Islam, Md. Sakhawat Hossen School of Information and Communication Technology The Royal Institute of Technology (KTH) Stockholm,

More information

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10 1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

More information

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: 2347-937X DATA MINING TECHNIQUES AND STOCK MARKET

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: 2347-937X DATA MINING TECHNIQUES AND STOCK MARKET DATA MINING TECHNIQUES AND STOCK MARKET Mr. Rahul Thakkar, Lecturer and HOD, Naran Lala College of Professional & Applied Sciences, Navsari ABSTRACT Without trading in a stock market we can t understand

More information

Data Refinery with Big Data Aspects

Data Refinery with Big Data Aspects International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 655-662 International Research Publications House http://www. irphouse.com /ijict.htm Data

More information

Preventing Resource Exhaustion Attacks in Ad Hoc Networks

Preventing Resource Exhaustion Attacks in Ad Hoc Networks Preventing Resource Exhaustion Attacks in Ad Hoc Networks Masao Tanabe and Masaki Aida NTT Information Sharing Platform Laboratories, NTT Corporation, 3-9-11, Midori-cho, Musashino-shi, Tokyo 180-8585

More information

Real Time Network Server Monitoring using Smartphone with Dynamic Load Balancing

Real Time Network Server Monitoring using Smartphone with Dynamic Load Balancing www.ijcsi.org 227 Real Time Network Server Monitoring using Smartphone with Dynamic Load Balancing Dhuha Basheer Abdullah 1, Zeena Abdulgafar Thanoon 2, 1 Computer Science Department, Mosul University,

More information

Improving Apriori Algorithm to get better performance with Cloud Computing

Improving Apriori Algorithm to get better performance with Cloud Computing Improving Apriori Algorithm to get better performance with Cloud Computing Zeba Qureshi 1 ; Sanjay Bansal 2 Affiliation: A.I.T.R, RGPV, India 1, A.I.T.R, RGPV, India 2 ABSTRACT Cloud computing has become

More information

Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms

Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms Volume 1, Issue 1 ISSN: 2320-5288 International Journal of Engineering Technology & Management Research Journal homepage: www.ijetmr.org Analysis and Research of Cloud Computing System to Comparison of

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 4, April-2014 55 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 4, April-2014 55 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 4, April-2014 55 Management of Wireless sensor networks using cloud technology Dipankar Mishra, Department of Electronics,

More information

The Evolution of Load Testing. Why Gomez 360 o Web Load Testing Is a

The Evolution of Load Testing. Why Gomez 360 o Web Load Testing Is a Technical White Paper: WEb Load Testing To perform as intended, today s mission-critical applications rely on highly available, stable and trusted software services. Load testing ensures that those criteria

More information

EFFECTIVE USE OF THE KDD PROCESS AND DATA MINING FOR COMPUTER PERFORMANCE PROFESSIONALS

EFFECTIVE USE OF THE KDD PROCESS AND DATA MINING FOR COMPUTER PERFORMANCE PROFESSIONALS EFFECTIVE USE OF THE KDD PROCESS AND DATA MINING FOR COMPUTER PERFORMANCE PROFESSIONALS Susan P. Imberman Ph.D. College of Staten Island, City University of New York Imberman@postbox.csi.cuny.edu Abstract

More information

Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

More information

How To Analyze Log Files In A Web Application On A Hadoop Mapreduce System

How To Analyze Log Files In A Web Application On A Hadoop Mapreduce System Analyzing Web Application Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environment Sayalee Narkhede Department of Information Technology Maharashtra Institute

More information

Machine Data Analytics with Sumo Logic

Machine Data Analytics with Sumo Logic Machine Data Analytics with Sumo Logic A Sumo Logic White Paper Introduction Today, organizations generate more data in ten minutes than they did during the entire year in 2003. This exponential growth

More information

Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392. Research Article. E-commerce recommendation system on cloud computing

Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392. Research Article. E-commerce recommendation system on cloud computing Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 E-commerce recommendation system on cloud computing

More information

International Journal of Engineering Research ISSN: 2348-4039 & Management Technology November-2015 Volume 2, Issue-6

International Journal of Engineering Research ISSN: 2348-4039 & Management Technology November-2015 Volume 2, Issue-6 International Journal of Engineering Research ISSN: 2348-4039 & Management Technology Email: editor@ijermt.org November-2015 Volume 2, Issue-6 www.ijermt.org Modeling Big Data Characteristics for Discovering

More information

An Active Packet can be classified as

An Active Packet can be classified as Mobile Agents for Active Network Management By Rumeel Kazi and Patricia Morreale Stevens Institute of Technology Contact: rkazi,pat@ati.stevens-tech.edu Abstract-Traditionally, network management systems

More information

Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing

Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing Hsin-Wen Wei 1,2, Che-Wei Hsu 2, Tin-Yu Wu 3, Wei-Tsong Lee 1 1 Department of Electrical Engineering, Tamkang University

More information

Transforming the Telecoms Business using Big Data and Analytics

Transforming the Telecoms Business using Big Data and Analytics Transforming the Telecoms Business using Big Data and Analytics Event: ICT Forum for HR Professionals Venue: Meikles Hotel, Harare, Zimbabwe Date: 19 th 21 st August 2015 AFRALTI 1 Objectives Describe

More information

PLUMgrid Toolbox: Tools to Install, Operate and Monitor Your Virtual Network Infrastructure

PLUMgrid Toolbox: Tools to Install, Operate and Monitor Your Virtual Network Infrastructure Toolbox: Tools to Install, Operate and Monitor Your Virtual Network Infrastructure Introduction The concept of Virtual Networking Infrastructure (VNI) is disrupting the networking space and is enabling

More information

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Somesh S Chavadi 1, Dr. Asha T 2 1 PG Student, 2 Professor, Department of Computer Science and Engineering,

More information

Backbone Capacity Planning Methodology and Process

Backbone Capacity Planning Methodology and Process Backbone Capacity Planning Methodology and Process A Technical Paper prepared for the Society of Cable Telecommunications Engineers By Leon Zhao Senior Planner, Capacity Time Warner Cable 13820 Sunrise

More information

Networking in the Hadoop Cluster

Networking in the Hadoop Cluster Hadoop and other distributed systems are increasingly the solution of choice for next generation data volumes. A high capacity, any to any, easily manageable networking layer is critical for peak Hadoop

More information

Mining an Online Auctions Data Warehouse

Mining an Online Auctions Data Warehouse Proceedings of MASPLAS'02 The Mid-Atlantic Student Workshop on Programming Languages and Systems Pace University, April 19, 2002 Mining an Online Auctions Data Warehouse David Ulmer Under the guidance

More information

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Mining Large Datasets: Case of Mining Graph Data in the Cloud Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large

More information

KEITH LEHNERT AND ERIC FRIEDRICH

KEITH LEHNERT AND ERIC FRIEDRICH MACHINE LEARNING CLASSIFICATION OF MALICIOUS NETWORK TRAFFIC KEITH LEHNERT AND ERIC FRIEDRICH 1. Introduction 1.1. Intrusion Detection Systems. In our society, information systems are everywhere. They

More information

A Small-time Scale Netflow-based Anomaly Traffic Detecting Method Using MapReduce

A Small-time Scale Netflow-based Anomaly Traffic Detecting Method Using MapReduce , pp.231-242 http://dx.doi.org/10.14257/ijsia.2014.8.2.24 A Small-time Scale Netflow-based Anomaly Traffic Detecting Method Using MapReduce Wang Jin-Song, Zhang Long, Shi Kai and Zhang Hong-hao School

More information

Architectures for massive data management

Architectures for massive data management Architectures for massive data management Apache Kafka, Samza, Storm Albert Bifet albert.bifet@telecom-paristech.fr October 20, 2015 Stream Engine Motivation Digital Universe EMC Digital Universe with

More information

AN OVERVIEW OF QUALITY OF SERVICE COMPUTER NETWORK

AN OVERVIEW OF QUALITY OF SERVICE COMPUTER NETWORK Abstract AN OVERVIEW OF QUALITY OF SERVICE COMPUTER NETWORK Mrs. Amandeep Kaur, Assistant Professor, Department of Computer Application, Apeejay Institute of Management, Ramamandi, Jalandhar-144001, Punjab,

More information

Flexible Deterministic Packet Marking: An IP Traceback Scheme Against DDOS Attacks

Flexible Deterministic Packet Marking: An IP Traceback Scheme Against DDOS Attacks Flexible Deterministic Packet Marking: An IP Traceback Scheme Against DDOS Attacks Prashil S. Waghmare PG student, Sinhgad College of Engineering, Vadgaon, Pune University, Maharashtra, India. prashil.waghmare14@gmail.com

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Performance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster

Performance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster Performance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster Sudhakar Singh Dept. of Computer Science Faculty of Science Banaras Hindu University Rakhi Garg Dept. of Computer

More information

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.

More information

Data Mining Approach in Security Information and Event Management

Data Mining Approach in Security Information and Event Management Data Mining Approach in Security Information and Event Management Anita Rajendra Zope, Amarsinh Vidhate, and Naresh Harale Abstract This paper gives an overview of data mining field & security information

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop ISSN: 2454-2377, October 2015 Big Data and Hadoop Simmi Bagga 1 Satinder Kaur 2 1 Assistant Professor, Sant Hira Dass Kanya MahaVidyalaya, Kala Sanghian, Distt Kpt. INDIA E-mail: simmibagga12@gmail.com

More information

MapReduce Approach to Collective Classification for Networks

MapReduce Approach to Collective Classification for Networks MapReduce Approach to Collective Classification for Networks Wojciech Indyk 1, Tomasz Kajdanowicz 1, Przemyslaw Kazienko 1, and Slawomir Plamowski 1 Wroclaw University of Technology, Wroclaw, Poland Faculty

More information

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2 Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data

More information

Understanding Web personalization with Web Usage Mining and its Application: Recommender System

Understanding Web personalization with Web Usage Mining and its Application: Recommender System Understanding Web personalization with Web Usage Mining and its Application: Recommender System Manoj Swami 1, Prof. Manasi Kulkarni 2 1 M.Tech (Computer-NIMS), VJTI, Mumbai. 2 Department of Computer Technology,

More information

Telecom Data processing and analysis based on Hadoop

Telecom Data processing and analysis based on Hadoop COMPUTER MODELLING & NEW TECHNOLOGIES 214 18(12B) 658-664 Abstract Telecom Data processing and analysis based on Hadoop Guofan Lu, Qingnian Zhang *, Zhao Chen Wuhan University of Technology, Wuhan 4363,China

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 21 CHAPTER 1 INTRODUCTION 1.1 PREAMBLE Wireless ad-hoc network is an autonomous system of wireless nodes connected by wireless links. Wireless ad-hoc network provides a communication over the shared wireless

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

Presenting Mongoose A New Approach to Traffic Capture (patent pending) presented by Ron McLeod and Ashraf Abu Sharekh January 2013

Presenting Mongoose A New Approach to Traffic Capture (patent pending) presented by Ron McLeod and Ashraf Abu Sharekh January 2013 Presenting Mongoose A New Approach to Traffic Capture (patent pending) presented by Ron McLeod and Ashraf Abu Sharekh January 2013 Outline Genesis - why we built it, where and when did the idea begin Issues

More information

Taxonomy of Intrusion Detection System

Taxonomy of Intrusion Detection System Taxonomy of Intrusion Detection System Monika Sharma, Sumit Sharma Abstract During the past years, security of computer networks has become main stream in most of everyone's lives. Nowadays as the use

More information

Survey on Load Rebalancing for Distributed File System in Cloud

Survey on Load Rebalancing for Distributed File System in Cloud Survey on Load Rebalancing for Distributed File System in Cloud Prof. Pranalini S. Ketkar Ankita Bhimrao Patkure IT Department, DCOER, PG Scholar, Computer Department DCOER, Pune University Pune university

More information

Mining various patterns in sequential data in an SQL-like manner *

Mining various patterns in sequential data in an SQL-like manner * Mining various patterns in sequential data in an SQL-like manner * Marek Wojciechowski Poznan University of Technology, Institute of Computing Science, ul. Piotrowo 3a, 60-965 Poznan, Poland Marek.Wojciechowski@cs.put.poznan.pl

More information

Association Rule Mining using Apriori Algorithm for Distributed System: a Survey

Association Rule Mining using Apriori Algorithm for Distributed System: a Survey IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 2, Ver. VIII (Mar-Apr. 2014), PP 112-118 Association Rule Mining using Apriori Algorithm for Distributed

More information

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk. Load Rebalancing for Distributed File Systems in Clouds. Smita Salunkhe, S. S. Sannakki Department of Computer Science and Engineering KLS Gogte Institute of Technology, Belgaum, Karnataka, India Affiliated

More information

A Scalable Network Monitoring and Bandwidth Throttling System for Cloud Computing

A Scalable Network Monitoring and Bandwidth Throttling System for Cloud Computing A Scalable Network Monitoring and Bandwidth Throttling System for Cloud Computing N.F. Huysamen and A.E. Krzesinski Department of Mathematical Sciences University of Stellenbosch 7600 Stellenbosch, South

More information

A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs In a Workflow Application

A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs In a Workflow Application 2012 International Conference on Information and Computer Applications (ICICA 2012) IPCSIT vol. 24 (2012) (2012) IACSIT Press, Singapore A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Analysis and Modeling of MapReduce s Performance on Hadoop YARN Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Research on Operation Management under the Environment of Cloud Computing Data Center

Research on Operation Management under the Environment of Cloud Computing Data Center , pp.185-192 http://dx.doi.org/10.14257/ijdta.2015.8.2.17 Research on Operation Management under the Environment of Cloud Computing Data Center Wei Bai and Wenli Geng Computer and information engineering

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information