Tolerating Late Timing Faults in Real Time Stream Processing Systems. Stuart Perks

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Tolerating Late Timing Faults in Real Time Stream Processing Systems. Stuart Perks"

Transcription

1 School of Computing FACULTY OF ENGINEERING Tolerating Late Timing Faults in Real Time Stream Processing Systems Stuart Perks Submitted in accordance with the requirements for the degree of Advanced Computer Science MSc Session 2014/2015

2 - ii - The candidate confirms that the following have been submitted: Items Format Recipient(s) and Date Project Report Report SSO (03/09/15) Implementation Software code and URL Supervisor, assessor (05/09/15) Implementation Documentation User manuals Supervisor, assessor (05/09/15) End of Project Presentation PowerPoint Supervisor, assessor (05/09/15) Log Book Digital Document Form Supervisor, assessor (05/09/15) Type of Project: Exploratory Software The candidate confirms that the work submitted is their own and the appropriate credit has been given where reference has been made to the work of others. I understand that failure to attribute material which is obtained from another source may be considered as plagiarism. (Signature of student) 2015 The University of Leeds and Stuart Perks

3 - iii - Summary The number of data sources for big data applications are increasing and interest is growing as companies start to realise its financial benefits. Many big data applications must be able to perform in real time, when the data is most valuable. These real time systems commonly work on unbounded real time streams of data that require highly parallelised, large compute clusters to meet the systems requirements. These real time stream processing systems must be reliable, fast, scalable and fault tolerant. Current fault tolerant techniques in real time stream processing systems do not deal efficiently with transient late timing faults caused by slow processing tasks. This project has proposed a new fault tolerant solution that uses data prediction to overcome slow processing tasks in real time stream processing systems. The proposed solution has been built as a Java library and applied as an extension to Apache Storm. The feasibility of the proposed solution has been investigated by performing experiments, where it was tested on a Storm cluster with a cloud data centre monitoring case study and compared to current fault tolerant solutions in real time stream processing systems. The data throughput, timings, accuracy of prediction and resource overheads of the proposed solution were evaluated by the experiments. The results of the feasibility study reveal that the new fault tolerant solution that uses data prediction has been a success. The solution effectively minimises and hides the impacts of slow processing tasks in a real time stream processing system.

4 - iv - Acknowledgements Stuart Perks would like to thank the following people for their help, feedback and guidance throughout the project: Dr Peter Garraghan (School of Computing, University of Leeds), Professor Jie Xu (School of Computing, University of Leeds), Dr Roy Ruddle (School of Computing, University of Leeds).

5 - v - Table of Figures Figure 1 Example Stream Processing System Sequential... 8 Figure 2 Example Stream Processing System Parallel... 8 Figure 3 Weather Data Accuracy Results Figure 4 Google Trace Log Data Accuracy Results Figure 5 System Model Figure 6 Example Fault Model Figure 7 Fault Tolerant Agent Details Figure 8 Cobertura Code Coverage Results Figure 9 Small Data Small Faults Data Throughput Single Node Figure 10 Small Data Small Faults Single Node Average Time Figure 11 Small Data Small Faults Single Node Deviation Figure 12 Small Data Small Faults End to End Average Times Figure 13 Small Data Small Faults End to End Data Throughput Figure 14 Large Data Small Faults Single Node Throughput Figure 15 Large Data Small Faults End to End Throughput Figure 16 Large Data Small Faults End to End Average Time Figure 17 Small Data Large Faults End to End Throughput Figure 18 Small Data Large Faults End to End Average Times Figure 19 Large Data Large Faults End to End Throughput Figure 20 Large Data Large Faults End to End Average Times Figure 21 Single Work Virtual Machine Average Resource Usage Figure 22 Single Worker Virtual Machine Average Network Usage Figure 23 Four Worker Virtual Machines Average Resource Usage Figure 24 Four Worker Virtual Machines Average Network Usage Figure 25 Sample Worker Log Data for Prediction Results... 52

6 - vi - Index of Tables Table 1 Related Work Stream Processing Fault Tolerance Table 2 Google Trace Log KNN Size Results Table 3 Experiment Metrics Table 4 Single Worker Virtual Machine Small Faults Prediction Metrics Table 5 Four Worker Virtual Machines Small Faults Prediction Metrics Table 6 Single Worker Machine Large Faults Prediction Metrics Table 7 Four Worker Virtual Machines Large Faults Prediction Metrics... 51

7 - vii - Table of Contents Summary... iii Acknowledgements... iv Table of Figures... v Index of Tables... vi Table of Contents... vii 1 Introduction Background Problem Statement Project Aim Evaluation of Solution Assumptions Contributions Project Methodology Software Project Management Conclusion Background Big Data Cloud Computing Relationship between Big Data and Cloud Computing Real Time Systems Stream Processing Systems Batch Processing Systems Dependability Real Time Stream Processing Systems Stream Processing System Model Real Time Stream Processing Case Studies Real Time Stream Processing Systems Requirements Stream Processing Technologies Selected Technology: Apache Storm Data Patterns Data Prediction Speculative Execution... 13

8 - viii Data Prediction Techniques When to use Data Prediction Incorrect Data Prediction Research Problem: Transient Late Timing Faults Real Time Stream Processing Issues Transient Late Timing Faults Big Data Scale Slow Processing Tasks Fault Tolerance Overheads Related Work: Fault Tolerance in Real Time Stream Processing Replication and Upstream Backup Data Prediction Opportunities Requirements Finalised Requirements Assumptions Fault Scenarios Tolerated Initial Investigation Initial Experiments Data Performance Metrics Results Accuracy Weather Data Accuracy Google Trace Log Data Timings and Training Data Size Initial Experiments Conclusion Design Initial Research System Model Late Timing Fault Activity Fault Tolerance Agent Technical Details Timing Fault Monitor Fault Tolerance Controller Prediction Evaluation... 31

9 - ix Key Design Decisions Decentralised Design Hot Standby Design Conclusion Implementation Methodology Case Study: Cloud Monitoring Storm Topology Storm Cluster Java Library Design Technical Details Monitoring Component Prediction Component Evaluation Component Slow Processing Fault Detected Scenario Technical Challenges Multiple Threads Programming Standards & Code Quality General Design and Implementation Project Management Tool: Maven Dependency Injection: Spring Framework Software Testing Code Quality Tools Experimentation Code Experimentation Plan Summary Metrics Experiments Plan Experiment Assumptions Fault Injection Simulator Experiments Results Experiment Results Summary Timing and Throughput Results Small Faults Single Worker Virtual Machine Four Worker Virtual Machines... 46

10 - x Large Faults Single Worker Virtual Machine Four Worker Virtual Machines Resource Usage Results Combined Small and Large Faults Prediction Accuracy Results Small Faults Large Faults Evaluation Solution Evaluation Future Work Personal Reflection Conclusion References Appendix A External Materials Project Timeline Version Example Google Cluster Trace Log Data Example Weather Data Initial Investigation CPU and Memory Results Initial Investigation Weather Results Cluster Experimentation Further Results Small Faults Single Worker VM Small Data Medium Data Large Data Large Faults Single Worker VM Small Data Medium Data Large Data Small Faults Four Worker VMs Small Data Medium Data Large Data Large Faults Four Worker VMs Small Data... 77

11 - xi Medium Data Large Data Appendix B Ethical Issues Addressed Data Sources Shared Resources Software... 78

12 - 1-1 Introduction 1.1 Background Big data as a term is used to describe enormous data sets that traditional database structures are incapable of managing effectively. The major characteristics of big data can be summarized as volume, velocity, variety and value [1]. To put the volume of big data into context Facebook on average has to store three billion new photos each month [2]. Big data contains valuable knowledge therefore applications have been developed in an attempt to extract this value. These big data applications commonly interact with real time data streams where data is most valuable on arrival and wants to be extracted as fast as possible [3]. Real time systems place as much importance on meeting their timing requirements as the correctness of result [4]. Stream processing systems (also known as also known as complex event processing systems [5], continuous query processing systems [6] ) are systems that are developed to process single or multiple sources of data that is in motion [7], [8]. These stream processing systems commonly perform operations such as filtering or aggregation. Cloud data centre monitoring is an example application that works with big data and is required to work in real time [9]. The four requirements of real time stream processing systems are low latency, high availability, scalability and fault tolerance [10]. Combining big data and real time stream processing results in systems that need to be scalable to meet their demands of volume and velocity. As these system grow the chances of faults entering the system increases because of increased complexity [7]. Although these hybrid systems provide opportunities for new applications that can lead to financial gains. Apache Storm is an example technology that enables these hybrid systems [11]. 1.2 Problem Statement The key problem this project has focused on is tolerating transient late timing faults efficiently at big data scale in real time stream processing system systems. Where the transient late timing faults have been caused by man made non malicious faults such as software bugs leading to slow processing tasks. When big data and real time systems are combined it leads to increased chances of transient late timing faults because of the increased complexity of the system. Late timing faults will lead to poor performance of the stream processing system and have a cascading impact on the entire system because of data dependences. Current techniques for fault tolerance in stream processing are replication and upstream backup. Both of these techniques do not deal efficiently with slow

13 - 2 - processing tasks. Replication does not scale resources efficiently when overcoming latency or slow processing tasks and also requires complex coordination protocols that may slow down all replications. Upstream backup is slow to recover and it will treat slow processing tasks as failures if detected. Upstream backup is used by Storm. A successful technique to hide late timing faults in cloud based gaming is Outatime, which uses data prediction of player commands and visual frames to mask latency in the system [12]. Although Outatime is not flexible as it only allows a single data prediction technique to be used and has only been applied to the gaming domain. 1.3 Project Aim The aim of the project has been to study the effectiveness of a newly proposed fault tolerant technique that uses data prediction to overcome the defined problem of late timing faults caused by slow processing tasks in real time stream processing systems. The solution has been designed so that it can distribute timing monitors throughout a distributed stream processing system that monitor for late timing faults at each processing node. If a late timing fault alert is raised the fault tolerant solution will select from a repository of data prediction techniques and send the predicted data rather than wait for the slow processing node to complete. When selecting a data prediction technique the system will consider the historical accuracy and timing requirements of each prediction algorithm for the required data stream and select the best. This proposed technique wants to take the success of Outatime [12] and apply it to real time stream processing where in certain domains predicted data is acceptable for example Twitter are using approximate answers to meet strict timing requirements [11]. 1.4 Evaluation of Solution The fault tolerant solution has been developed and evaluated using a baseline stream processing architecture built in Storm. This stream processing architecture calculates the average CPU and memory usage of servers using the Google trace log data set [13]. The proposed solution using data prediction was compared against a baseline architecture running no fault tolerance and also an architecture running with upstream backup. Transient late timing faults were injected into each solution where the four metrics of throughput, timing, prediction accuracy and resource overheads were used to measure the feasibility of the proposed solution. 1.5 Assumptions A large variety of data prediction techniques exist that are applicable for many different domains. For this project a selection of simple algorithms were used to enable the

14 - 3 - investigation into tolerating late timing faults. Writing an algorithm that scores highly accurate prediction for cloud data centre monitoring is out of scope of this project. It is assumed that the user of the newly proposed fault tolerant solution can create their own prediction algorithm or select the best fitting one for their unique scenario. It has been considered that there is not a generic algorithm that fits all domains for data prediction and therefore the ability to choose from a selection is a key part of the proposed solution. 1.6 Contributions This project has contributed a new fault tolerant technique that overcomes late timing faults caused by slow processing tasks by using data prediction. A second contribution is the feasibility study of the newly proposed fault tolerant technique. 1.7 Project Methodology A systematic approach has been taken to complete this project looking into the problem first, performing a literature and then designing, implementing and experimenting with a developed solution. To begin the project a list of key terms was defined that related to the project topics. A comprehensive literature review was completed on these key terms to understand the current research challenges and solutions in the area. Once the problem of late timing faults had been defined, a solution was modeled to overcome the issues of late timing faults in real time stream processing. Agile development was used to develop the modeled software artifact. Once the solution had been developed comprehensive experiments were conducted to provide empirical evidence of the effectiveness of the proposed solution. To record all the research and technical decisions a log book was kept where notes and references were made that were later referred to in the project. 1.8 Software Project Management For the software aspect of the project an agile approach was followed which was particularly important for an exploratory software project because a large part of the project was spent refining the problem. A key agile principle followed was responding to change over following a plan which was important with the changing requirements at the beginning of the project. When developing the software an incremental approach was used where small increments were added to the software artefact as soon as their requirements had been set and they had been added to the design. Potential blockers in the development were identified early, for example the setting up of the Storm cluster which could be completed before the software had been fully designed.

15 - 4 - A timeline was created with the key parts of the project broken down into manageable parts see Appendix A page 63. The timeline was updated each week with the dynamic nature of the project, but key milestones were identified and met. The timeline was broken into weekly sprints where the allocated tasks were completed in that sprint. Multiple iterations of requirements, design, implementation and testing were completed when new knowledge and requirements had been added to the project. A large focus was on maintaining the quality of the code therefore unit tests were written that covered each new code feature and these acted as regression tests for future iterations. Using correct source code management was also important and began at the beginning of the project. A Git [14] repository was set up on Bitbucket [15] where code was committed. Each new component of code or minor change was committed to allow ease of reverting back if mistakes were made. 1.9 Conclusion The aim of this project was to develop and evaluate a newly proposed technique that used data prediction for tolerating late timing faults caused by slow processing tasks in real time stream processing systems. An example cloud data centre Storm system was developed and enhanced with the proposed fault tolerant solution that uses data prediction to overcome late timing faults. The focus of this project was on tolerating late timing faults caused by slow processing tasks. A feasibility study has been completed where the proposed solution was compared against a baseline solution with no fault tolerance and current techniques for fault tolerance in real time stream processing.

16 - 5-2 Background 2.1 Big Data There are a large number of domains where data is being collected from in big data scale including the Internet of Things (IoT), social media, retail, medicine, science and finance [1]. Big data analytics workflow takes data, ingests it applying filters and putting it into a understandable form, performs analysis using data mining techniques and enables the extraction of valuable knowledge. This valuable knowledge leads to improvements of current services and creates opportunities for new ones leading to financial gains. The fourth paradigm of science is said to be data intensive scientific discovery [16]. To enable the running of these big data applications they require large computing clusters that can scale vertically and horizontally [17]. 2.2 Cloud Computing Peter Mell and Timothy Grance at the National Institute of Standards and Technology define cloud computing as: Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction [18] p.2. Cloud computing offers three core service models software as a service (SaaS), platform as a service (PaaS) and infrastructure as a service (IaaS) [18]. Cloud consumers use these services as their demand requires and usually by a pay per use model. Four cloud deployment models exist private cloud that is owned exclusively by a private organization, a public cloud where its resources are available for public use, a community cloud where the cloud resources are exclusive for a community of users and a hybrid cloud where it is made or two or more distinct public, private or community clouds. Cloud computing enables end users to have an apparent supply of infinite resources that can grow and shrink as required. 2.3 Relationship between Big Data and Cloud Computing A key characteristic of cloud computing is rapid elasticity [18] that enables the appearance of infinite cloud resources providing the capabilities of storage and processing for big data.

17 - 6 - With the need for big data applications and hardware this has accelerated the creation of cloud computing resources and also enhanced the analytical capabilities for big data, resulting in cloud and big data supplementing each other [1]. Cloud computing and big data work closely together but provide different services. Cloud computing clients are usually technology companies that use them for applications such as big data to provide solutions to business clients. Big data is a type of application that runs on the cloud resources and are used to provide insight to these business clients. 2.4 Real Time Systems A real time system is one where the importance of the physical timings of the result is just as important as the correctness of results [4]. Real time systems can be classified into hard real time or soft real time. Hard real time systems must meet the response time in the order of milliseconds and must be able to maintain operations autonomously. Hard real time systems are usually mission critical for example a safety system in a aeroplane, where if it fails it causes serious consequences. A soft real time systems response times are usually required in seconds and if the deadline is missed no serious consequences are caused. Using redundancy in soft real time systems is acceptable where checkpoints using rollback and recovery techniques can be used to overcome faults [4]. In hard real time systems checkpoints are limited because of the time it takes to rollback and recover or the fault is too severe to recover from. As real time systems are required to have very fast response times, most error detection and recovery must be dealt with by the system as human intervention is too slow, this is applicable to both hard and soft real time systems, but soft systems may involve some human intervention [4]. Data analytics is now being required to be completed in real time. Analytical systems have moved from online transaction processing of capturing data, to online analytic processing where analysis gave meaning to the data and recommendations for future actions and finally now to real-time analytic processing where the previous proven analytical techniques have been applied to data in motion [19]. As described the Internet of Things is a major source of real time data, but real time data can be used in the many domains such as the social web, financial services or for marketing. 2.5 Stream Processing Systems Stream processing systems (also known as also known as complex event processing systems [5], continuous query processing systems [6] ) are systems that are developed to process single or multiple sources of data that is in motion [7], [8]. These systems usual architecture is a directed acyclic graph [20]. They are commonly monitoring systems where aggregation or filtering is performed. Each single point of data that enters the stream is

18 - 7 - processed individually allowing for immediate results. The scale and speed of this data means that humans cannot process and analyse the data realistically, therefore systems are required that can analyse and process this data, extracting meaning and even taking actions by itself. 2.6 Batch Processing Systems Batch processing is when a fixed sized data set is processed completely and the result usually cannot be viewed until it has completed processing the whole data set [21]. Apache Hadoop is an example of a batch query processor [22]. A combination of stream processing and batch processing has formed micro batch processing [21]. These micro batch systems work with an unbounded stream of data but group the data into small batches that are processed as a small batch. This reduces complexity but can increase latency. Examples of micro batch processing systems are Apache Spark Streaming [23] and Trident [21]. 2.7 Dependability Dependability is concerned with providing a trusted software service and includes the following attributes availability, reliability, safety, integrity and maintainability [24]. To achieve dependability four categories have been formed. Firstly fault prevention to stop the introduction of faults, fault tolerance to enable meeting the system specification in the presence of faults, fault removal and fault forecasting to predict the present number of potential faults and the potential impacts of these faults. Fault tolerance is the continued behaviour of a system that meets its specifications even with the presence of faults [25]. Fault tolerance works by using error detection that looks to discover errors in the system and recovery to remove errors and return the system to a valid state by using error handling and fault handling techniques. Error handling removes errors from the system state by using rollback, roll forward or compensation techniques. Fault handling prevents the same faults from returning by diagnosis, isolation, reconfiguration and re-initialisation [24]. Faults can either be transient where they have a limited duration or permanent where if a component fails it cannot recover for a long period of time or indefinitely [25]. Failures include timing failures where the specified time of arrival is either early or late, or the duration of delivered information is incorrect [24]. In streaming systems three types of recovery guarantees have been stated precise recovery where any effects from failures are completely hidden except for some increases in latency, rollback recovery where no information is lost but it may result in duplications and

19 - 8 - gap recovery where only recent data is recovered older data maybe dropped to reduce overheads [26]. 2.8 Real Time Stream Processing Systems Stream Processing System Model Traditional stream processing systems follow a pattern of having data flowing through the system, they usually work in a sequential pattern where a processing node does a small task and then outputs to another task which completes its own task until all processing is completed by the system see Figure 1. Each processing node can also be parallelised and this is commonly required to improve performance see Figure 2. Figure 1 Example Stream Processing System Sequential Figure 2 Example Stream Processing System Parallel

20 Real Time Stream Processing Case Studies IoT The Internet of Things refers to a collection of uniquely identifiable connected objects such as RFID tags, weather sensors and mobile phones [1], [27]. With the Internet of Things growing, data is being collected constantly and from a growing number of sources such as temperature sensors. This data is being collected in real time therefore real time systems need to be developed to gain the knowledge benefits of the data [8]. Mobile Millennium is an application that uses crowd sourced GPS data and performs machine learning to estimate traffic conditions in cities in real time [28]. Real time systems commonly also provide alerts on sensors such as motion detection sensors in houses reporting events [19]. Social Web Social web mining in particular Twitter needs to be done in real time as Twitter tweets are written and displayed instantly by a user [29]. In 2013 Twitter reported that it collected approximately 500 million tweets per day [8]. Real time stream processing is used by twitter to calculate the current trending topics or update the number of followers along with other tasks [21]. Real time analysis of this Twitter data can provide marketing opportunities, reveal breaking news stories to news corporations or provide knowledge of customer relations. Data Centre Monitoring Cloud data centre monitoring is an example real time stream processing application [9]. It enables the controlling and management of hardware and provides insight into the quality of service a data centre is providing ensuring it meets its service level agreements. This can include scheduling tasks, migrating virtual machines and detecting faults in hardware. Fraud Real time fraud detection can limit the impact of fraud and maintain companies reputations for security for example ebay run fraud detection analysis on real time trading activity [8]. This analysis will limit customer accounts from being used to bid on lots of products, or detect false selling accounts improving the quality of the ebay experience. Financial institutions perform real time fraud detection to protect their customer s money increasing the institutions reputation [10]. Fraud detection is improved by real time analysis as it can limit the damage at that moment, rather than wait and learn from past experiences. Financial Services Financial institutions that perform electronic trading will use real time analysis because of the volume of market data. Volumes of electronic trading are growing exponentially, resulting in traditional processing systems being unable to manage this growth in volume [10]. These

21 systems require low latency as incorrect information or delays in processing can result in large financial losses. Advertising Advertising companies can release campaigns on social media and real time systems enable the real time tracking of the success of these campaigns [30]. Allowing real time adaptation of the content of the campaign to interact with customers further and more successfully generating higher revenues from marketing. Websites also use real time systems to monitor website click streams to enable them to target marketing to users [31], leading to improved marketing and profits. Security Systems that are exposed to networks use real time attack detection by using real time analytics to detect and prevent attacks on these systems such as denial of service [32]. Gaming Real time stream processing is being used in cloud based gameplay where game processing is completed in a cloud environment and the results are streamed in real time to the user [12]. The games are played in real time and if the real time system cannot meet the real time requirement game players will no longer use this type of gameplay Real Time Stream Processing Systems Requirements Low Latency High Availability Scalable Fault Tolerant Real time systems work on critical, long running dataflow where scalability, high availability [20], [33], [34] and fault tolerance are essential [31]. To meet the high availability demand the system needs to be running efficiently at all times [10]. This continuous running is required as the data is constantly being generated for the system in real time [8]. Latency needs to be minimised where the stream is highly optimised for high volume low latency processing [20], [10], [35]. Along with low latency requirement data always needs to be kept moving in the stream meaning that data may not be blocked or delayed in the stream at all otherwise it will lead to bottlenecks in the flow [10]. The nature of distributed systems will result in imperfections to the data in the streams. A real time stream processing system should be able to adapt to these imperfections while still meeting the near real time requirement. The system should guarantee the production of predictable and repeatable results [10]. The recovery from these failures should be resource efficient [35] and the system must be able to recover without affecting other results. The efficiency of these

22 systems can be measured by runtime overheads with no failures and the recovery overhead in presence of failures [36] Stream Processing Technologies Apache Storm is a general framework that enables the near real time processing of data over a parallel distributed system. Storm is reliable as it contains a guaranteed message processing framework to deal with failures in the system [11], [37]. Apache Samza is a distributed stream processing framework that has simple message process API. It uses YARN to manage the distributed cluster that processes its jobs [38]. Samza stores states by snapshotting tasks and can use YARN to migrate failed tasks to other machines in the cluster. Apache S4 is a distributed framework that allows the continuous processing of streams of data [39]. Data events are routed through processing nodes called Processing Elements that either output to one or more other Processing Elements or results are outputted. Apache S4 overcomes failed processing nodes by redirecting messages to a standby node to be processed. These failed nodes are detected using a timeout and a heartbeat by the cluster supervisor node. Amazon Kinesis is a service offered by Amazon Web Services (AWS) that can process real time data from many different resources with the ability to scale to handle thousands of data sources [40]. This is a commercial product that is required to be used with AWS and the service paid for. Apache Spark Streaming is a component of Apache Spark that enables micro batch processing of data streams [23]. Spark Streaming uses check pointing to overcome failures in the stream architecture Selected Technology: Apache Storm Storm is one of the most reputable and widely used frameworks for distributed stream processing in near real time [37] [11]. Storm defines topologies which represent the graph of computation in the system containing the data streams and the processing nodes. Storm has two main components, firstly spouts that are defined data streams into the stream processing system and secondly bolts which perform the computation on the stream. A bolt may receive multiple streams of data and may itself emit multiple streams of data. The data passed around by Storm is referred to as a tuple which represents a single data point emitted by a spout. A Storm cluster is formed of a single master node and multiple worker nodes. The master node performs job tracking, allocating work to worker nodes and monitoring for faults.

23 The worker nodes listen for tasks assigned by the master and process them as appropriate. This approach allows Storm to scale over a large cluster. Storm in Industry A number of companies are using Storm in their systems for a variety of tasks these include Twitter, Spotify, Yahoo and Alibaba [41]. For example Twitter use Storm to perform simple aggregation, filtering or counting along with preforming more complex tasks such as clustering. Twitter commonly uses approximations in its data to increase performance and this data is then later corrected using MapReduce [11]. Storm Fault Tolerance Storm uses upstream backup to enable guaranteed processing of messages [37]. The Storm spout tracks each data tuple that it submits to the stream processing system. If this tuple or any of the tuples it generates further downstream fail or are lost they can be replayed by the spout. Each processing bolt will either acknowledge or fail a data tuple. If all data tuples and their children are processed successfully the acknowledgement will return to the spout that emitted the original tuple. If any of the data tuples fail or timeout the spout that emitted the original tuple will have its fail method called. The default timeout is 30 seconds. An issue with this approach is data maybe sent twice as it has just processed slowly rather than failed. In a Storm cluster nodes are designed to be stateless and any information is stored on disk. If a node fails it is designed to quickly restart. If the node cannot restart the master node allocates its task to another node [21]. 2.9 Data Patterns Data characteristics are formed from observations or measurements [42]. Data can be discrete where it is classified into a category (limited range of values) or continuous where the data can take any value. Data characteristics have referential components which are related to the context in which the data pattern characteristics are formed. Referential components have three important types time, space and the grouping of data items. Spatial data patterns are when there are patterns in the distance between data values. Temporal data patterns are when there are patterns in the data based on time Data Prediction Data value prediction is a technique commonly used to overcome delays in a data dependent system [43]. Data dependences occur when the execution of data is dependent

24 on another set of data that must be processed first before the dependent process can be completed Speculative Execution Speculative execution uses data prediction but predicts the data before it knows if it is required too. This is a general latency hiding technique where work is processed before it is known if it is required, or data is being waited for from a slow operation [12]. Speculative execution hides delays and can decrease worst case running time of a system [44] Data Prediction Techniques Wang, Kai et al. [43] present three approaches to overcome the issue of data dependences in instruction level parallelism and then looks at combining these to present hybrid predictors. The first method named last outcome prediction is a simple method for data value prediction is to look at the last known instance of the value being predicted and use this. Studies showed that the accuracy of this method was only about 49%, resulting a large number of incorrect predictions that want to be minimised. The second method data value stride based prediction works by looking at the stride between results at each time stamp if this stride is a constant it confidently predict the next value. Deciding on the amount of data to store to reference in predictions is difficult as the engine wants to have larger amounts of data to improve predictions but it also wants to minimise overhead. To reduce this overhead of having a larger number of values a technique is to remove duplicate values. The final technique presented is two level predictions where information usually has four or fewer unique values therefore these four values can be considered resulting in a 25% chance of accuracy by just randomly selecting from the four values. This approach is lightweight and fast, but can result in relatively low accuracy. Combinations of these approaches have then been evaluated and the results show improvements in the accuracy measures of improving these results When to use Data Prediction Calder et al. [46] describe two approaches of when to use prediction techniques. A confidence measure for the prediction is required, depending on how confident the system on the prediction and the current scenario of the prediction need to be considered. The impacts of incorrect predictions are weighed up as well before executing the predictions fully to the end user. There are two models described for the confidence system, firstly a confidence saturating counter and a secondly a confidence history counter Incorrect Data Prediction Zhou, Huiyang et al. [47] describe two methods to recover from incorrect predictions these are complete squashing or selective re issuing. Complete squashing is where all instructions

25 that are incorrectly predicted are cleared and re-gathered and processed. Selective reissuing only clears data and processes impacted by the incorrect predictions.

26 Research Problem: Transient Late Timing Faults 3.1 Real Time Stream Processing Issues The focus of this project is on tolerating transient late timing faults caused by slow processing tasks in real time stream processing systems Transient Late Timing Faults Late timing faults are a key issue in real time stream processing systems as they place as much importance on meeting the real time requirements as correctness. These systems are required to be low latency systems [10] enabling them to meet the real time requirements to allow data to be processed as fast as possible as it is most valuable on arrival [3]. Late timing faults can result in missing key decision making opportunities that are only available in short windows of time reducing the value of the data. As stream processing systems are commonly designed as a directed acyclic graph a single slow processing task or faulty node can result in major impacts to the rest of the stream processing system because of data dependences [48]. In general distributed systems a few slow tasks can cause a significant slow down of job execution know as the Long Tail problem [49]. Transient late timing faults are even more difficult to tolerate as they are only seen for short periods of time and commonly appear to occur randomly Big Data Scale As big data is applied to real time stream processing systems the system is required to grow to large compute clusters with many nodes and this increases the chances of late timing faults forming in the system [7], [48]. As the complexity of the system grows the chances of data imperfections (late, corrupt, out of order, missing data) increases because of the increased complexity of the large distributed system [10]. Ideally stream processing systems want to mask any faults to hide the latency they cause in the system and reduce impact to other parts of the system Slow Processing Tasks A common cause of late timing faults in real time stream processing systems are slow processing tasks [3] (also known as stragglers [50]). The chances of slow processing tasks increase as the stream processing system grows to enable the processing of big data scale. Slow processing tasks can be caused by human made non malicious faults where developers develop software with faults unintentionally [24]. Software is complex and therefore rarely free of any errors [51]. Well tested software can still miss faults, these software faults can lead to transient faults that are unpredictable and can run for a long

27 period of time before being recognised. Software running for long periods of time commonly show increased failure rates and degraded performance [52]. As long running software increases the chances of software ageing [24], [52], [53]. In stream processing systems the system commonly has to run for a long period of time to meet its highly available requirement. Software ageing does not cause an application to fail immediately it takes time. Causes of software ageing include memory leaks or memory bloating, un-terminated threads, locking on data access or data corruption [24]. These issues can happen in the application itself, the external libraries that it uses or the operating environment it is running in. As these issues can occur in many different places and take a long time to reveal themselves they are commonly transient faults. When a slow processing task happens at one of the nodes in the stream processing system they will too have a major impact on the rest of the stream processing system Fault Tolerance Overheads One of the key requirements of real time stream processing system is to tolerate faults [10]. Fault tolerant methods in real time stream processing need to recover as quickly as possible to enable the system to meet its strict timing requirements [3]. Although, fault tolerant methods can incur their own costs to timing and resource overheads, therefore any fault tolerant technique for real time stream processing system needs to be able to scale and recover fast from faults so they do not impact other parts of the distributed system [33]. 3.2 Related Work: Fault Tolerance in Real Time Stream Processing Replication and Upstream Backup The two most common approaches to fault tolerance in stream processing systems are replication where there are at least two copies of each node and upstream backup where upstream nodes store sent messages and replay if a node fails downstream [26]. The issue with replication is that it does not scale resources efficiently as resources are required at least 2x for each node or task which becomes a bigger issue when working with big data. Replication does not deal with slow processing tasks as they commonly use synchronization between the replicas which will slow both processing nodes down. Upstream backup is slow to recover as it waits for a timeout and it doesn t tolerate slow processing tasks it will treat them as a failure incurring the costs of replaying the data, which may lead to further issues because of duplication in the system. Hwang et al. [33] describes three standard high availability approaches to fault tolerance. The main objective of this work is to evaluate the three approaches for distributed stream processing. Firstly passive standby is a technique where tasks are replicated so two processes work on the same task where the backup process takes over if the primary

28 process fails. This is successful because it is quick to recover but has the extra costs of running the processes twice. A second approach named upstream backup is to use upstream nodes as backups to the downstream nodes, where if the downstream node fails the upstream node that sent it the data already has kept this data and can begin processing it and bypass the downstream node. Less resources consumed when compared to the passive standby approach, but more complex to manage the removal of data in upstream nodes once it is not required. This approach also treats slow processing tasks as failures because it will only detect them with timeouts. A final approach proposed is named active standby where the primary process is replicated but they both receive input from the upstream node that stores the data it sends combining the passive standby and upstream backup. Each of these approaches can be extended using K-safety where the back up nodes are replicated K times enabling the recovery from multiple node failures. Although the cost grows linearly with the value of K. Shah, Mehul A et al. [31] present Flux a parallel approach to fault tolerance. This approach is where a process runs simultaneously in parallel for a specified amount of time. If part of the process fails it can be recovered by the parallel execution. The main Issue with this approach is the overhead costs of running in parallel. The synchronization between the two processes means that a slow processing task can slow both processes down [3]. Balazinska et al. [7] present DPC (Delay, Process and Correct) a replication protocol able to handle failures in processing nodes and network failures in a distributed stream processing system. It works by using replication to mask processing node and network failures. In response to failed nodes, upstream nodes are switched to their replicated or redundant neighbours. This approach guarantees eventual consistency and is flexible to the user but has overheads of buffering tuples when failures happen and the cost of replication. Koldehofe, Boris et al. [36] have proposed a method that allows the use of rollback recovery without the need for using persistent memory to store state to allow recovery from multiple failures. The main objective of this paper is to attempt to overcome multiple hardware or system failures in distributed streaming architectures. The state is saved when its execution depends only on the state of incoming event streams, where the state has minimal non-reproducible state and these are called save points. This approach allows the processing to be recovered if multiple fails happen. Although the approach still requires extra processing and storage along with increased complexity. Matei Zaharia et al. [3] present discretized streams. The main objective of this is to overcome slow nodes as other techniques do not deal with this. Discretized streams uses micro batch processing which simplifies synchronization between nodes in the distributed system. Discretized streams uses parallel recovery to recover special Resilient Distributed Datasets (RDD) partitions on nodes that keep state in memory and allow re-computation if

29 failures happen. When a node fails or is detected as a slow task, the task is split and distributed throughout the distributed system increasing the speed at which it is processed and limiting the impact on nodes throughout the system. The results of this approach were successful but it was applied to micro batch processing not stream processing Data Prediction A successful technique used in real time cloud based gaming to overcome late timing faults is speculative execution. Lee, Kyungmin et al. [12] present Outatime a speculative execution engine which main objective is to overcome network latency issues in cloud based mobile gaming. Outatime uses speculative execution successfully to enable the masking of 250ms network latencies for cloud based mobile gaming. The technique has been applied to navigation, impulse events and visual parts of the game. Outatime produces predicted future frames to the user so they appear to have no latency. Studies have shown that users prefer the predicted frames and interaction rather than the lagging of graphics and gameplay. These frames are predicted by considering the recent input behaviour, state space subsampling and times shifting, incorrect prediction compensation and bandwidth compression. The approach does require the costs of rollback for incorrect speculation and extra processing of speculative execution. With any data prediction the predictions formed maybe incorrect or a prediction may not be able to be formed because of lack of training data [43]. Forming a prediction also takes time and increases overheads of resources as the data needs to be stored and then processed to form a model. Studies by Kai Wang et al. [43] for data prediction used to overcome data dependences have shown that combinations of prediction techniques and the use of different algorithms can improve performance. Table 1 Related Work Stream Processing Fault Tolerance Authors Objectives Solution Solution Technique Successes Weaknesses [33] Maintain high Passive A replicated task Short failure Overheads of Hwang et availability from Standby takes over if first recovery time. running parallel al. processing node fails. process. failures. [33] Maintain high Upstream Upstream nodes are If any Complexity of Hwang et availability from Backup used as backup to downstream node notifying upstream al. processing node downstream nodes. fails, the upstream nodes to remove failures. node can replay stored data that is

30 the data to any no longer required node to recover. in upstream nodes. [33] Maintain high Active Each processing Fast recovery Overheads of Hwang et availability from Standby node has a time. storage. Overheads al. processing node dedicated of parallel failures. secondary node that processes. actively obtains data from upstream nodes. [33] Overcome K-safety Extend passive Recover from High overheads Hwang et multiple node standby, upstream multiple node that grows linearly al. failures backup and active failures. with K. standby, by having K back up nodes associated to the primary node. [12] Lee, Overcome Microsoft: Speculative Hide latency. Requires rollback Kyungmin latency in wide Outatime execution for Experiments if incorrect. et al. area networks for cloud based navigation, impulse events and visual. show positive results. Overheads. mobile gaming. [31] Shah, Maintain high Flux: Process Replicated process No lost data. No Overheads of Mehul A availability from Pairs runs in parallel. duplicate data. running parallel et al. processing node process. failures. [7] Maintain high DPC Borealis Replication. DPC is Guarantees Overhead of Balazinska availability from a protocol that will eventual buffering tuples et al. processing node handle the failures consistency. when failures and network in a distributed Flexible. happen to allow failures. stream system by replays. coordinating replication processes. Costs of replication.

31 [3] Matei Overcome slow Discretized Parallel recovery to Overcomes issues Overheads of Zaharia et nodes to Streams recover special of slow nodes. storing extra data al. maintain high RDD partitions on to recover. availability. nodes, that can be recomputed if failures happen. Overcome slow Works only on micro batch processing. stragglers run speculative backup copies of slow tasks. [20] Brito Maintain high Logging Message logs are Improve speed Overheads of et al. availability. Technique replicated and if a using speculative storing and node fails process execution to processing logs. can regenerate from reduce the impact Recovering information stored of logging. processing lost. in the log. Uses speculative execution to speed up parallel events. [36] Multiple system Rollback State is saved only Low overheads Extra overheads. Koldehofe or hardware Recovery when execution compared to other Increased, Boris et failures. depends on the approaches. complexity. al. state to allow recovery. 3.3 Opportunities The two major techniques for fault tolerance in real time stream processing replication and upstream backup do not tolerate slow processing tasks efficiently. Discretized streams approach has dealt with slow processing tasks but uses micro batch processing instead of stream processing, and uses the benefits of batch to overcome slow processing tasks therefore it cannot be applied to stream processing. Outatime has successfully overcome issues of latency using data prediction but is restricted to a single prediction algorithm. The success of Outatime using data prediction presents an opportunity to see if it can be used to overcome late timing faults caused specifically by slow processing tasks in real time stream

32 processing systems. To increase the flexibility of Outatime s approach the proposed solution will have a repository of data prediction algorithms where the most accurate algorithm is chosen which can be trained while still meeting its time requirement and tolerating the late timing fault.

33 Requirements 4.1 Finalised Requirements Develop a software solution to tolerate transient late timing faults in real time stream processing systems. Caused by man made non malicious software bugs leading to slow processing tasks. Built as an extension to Apache Storm. Be able to scale over multiple machines. Have low overheads for resource consumption. The fault tolerant solution should not impact the timing of the system. Use data prediction to overcome transient late timing faults. A repository of data prediction techniques will be available and the best one chosen. The repository of data prediction techniques will be easily extendable by providing a Java interface to implement. Evaluate the developed solution using cloud data centre monitoring as a case study. To run experimentations on the proposed solution, a base line solution and a current fault tolerant technique to enable evaluation. 4.2 Assumptions Specific algorithms with high accuracy are out of scope as domain dependent. Users of the proposed fault tolerant solution can provide their own algorithms for the solution with high data prediction accuracy. Predicting late timing faults is out of scope. System will use a set timeout. 4.3 Fault Scenarios Tolerated The solution will focus on tolerating transient late timing faults caused by man made non malicious faults such as software bugs that lead to slow processing tasks on a single stream processing node.

34 Initial Investigation 5.1 Initial Experiments To consider the effectiveness of data prediction techniques an initial investigation into simple data prediction techniques was performed. Five prediction algorithms were developed Last Outcome, Stride Based, Long Stride Based, Simple Markov and KNN. The prediction algorithms were evaluated to see their accuracy, the time it took to form the models for a large number of predictions and the impact that the amount of training data made available had. 5.2 Data The experiments were applied to two types of publicly available data first from Google trace log dataset from actual Google compute clusters containing data such as CPU and memory usage [13] and weather sensor data from the University of Edinburgh containing attributes such as temperature and air pressure [54]. See Appendix A 13.2 and 13.3 for examples of the data used. These two datasets were chosen as they are from very different domains to see if prediction could be applicable to different domains. To increase the realism of the Google data the CPU and memory usage for every 5 minutes was extrapolated by taking averages for each minute between the 5 minute gaps. 5.3 Performance Metrics To evaluate the success of the data prediction accuracy and numerical data prediction metrics were used. Accuracy can be used for to measure how successful a prediction is used but is commonly used for classification types where data is predicted as a type [55]. As the data used was mostly non integer the accuracy was allowed a variance of less than 1. Methods have been designed specifically for evaluating numeric prediction. Mean squared error and root mean squared error use the same formula but root mean square error is square rooted to bring its value into the same dimensions as the data. These two metrics are commonly used but can exaggerate the effects of large outliers. Mean absolute error just averages the magnitude of the individual errors without considering the sign. This metric limits the impacts of large outliers compared to mean squared error. Whichever metric is chosen it is commonly found that the best prediction technique will be revealed by any of the numerical metrics described [55].

35 Results Accuracy Weather Data Shown in Figure 3 and Figure 4 are the accuracy values for the predictions of weather and Google trace log data. The accuracy was allowed a variance of less than 1 between the value and the predicted value to take into account the non integer values. The results clearly show that prediction can be a powerful method as accuracy was high for both domains but not for all attributes. The results show that different data prediction algorithms perform differently and certain attributes perform extremely well and others poorly. As the prediction is domain specific and a selection of algorithms can be used with the proposed solution it is up to the user to develop specific algorithms for their domains. The design of the proposed solution is granular meaning different algorithms can be used for different processing nodes if required to improve performance and reduce overheads. Figure 3 Weather Data Accuracy Results

36 Accuracy Google Trace Log Data Last Outcome Stride Based Simple Markov KNN 0 CPU Usage Memory Usage Figure 4 Google Trace Log Data Accuracy Results Timings and Training Data Size To investigate the size of data required and the time it takes to build a prediction the timings of the data predictions were taken. It soon became clear that with this type of data the more data there is doesn t result in more accurate data this was expected from background research [43]. Having strict timing requirements means that data wants to be made as quickly as possible and this needs considering when selecting algorithms, which was integrated into the design. Table 2 Google Trace Log KNN Size Results Data Size of K Algorithm Time (nano) Accuracy 1% Variance CPU Usage 4 KNN CPU Usage 50 KNN CPU Usage 100 KNN CPU Usage 1000 KNN Mem Usage 4 KNN Mem Usage 50 KNN Mem Usage 100 KNN Mem Usage 1000 KNN

37 Initial Experiments Conclusion The initial investigation into using data prediction with different domains shows that different algorithms will perform differently on different types of data. These two domains show that replaying the last value is a successful technique as it is accurate and also will be the quickest and most memory efficient. Increases to the size of look back data for KNN shows it actually decreases the accuracy bringing evidence to the research that prediction only requires the last four seen values for successful data prediction [45].

38 Design 6.1 Initial Research To develop a fault tolerance system for stream processing architecture for big data, common big data application stream processing systems were first researched and understood. The Lambda Architecture [56] is a popular architecture for processing big data and contains a real time component. This architecture works by having data streams usually using Apache Kafka that feed into Apache Storm where the stream data is processed before being visualised, persisted or processed further. As the fault tolerant solution was being developed as an extension to Apache Storm the fault tolerant solution would work closely with Storm bolts. As Storm can already distribute bolts throughout a cluster the proposed fault tolerant solution can harness the power of Storm to distribute its monitoring capabilities for it. To initially design the fault tolerance solution existing design patterns were considered and an observer design pattern exists [57]. This observer pattern is designed to monitor a object so that when its state changes all its monitors are updated. Further work on this design pattern and applying it to distributed systems has led to the publish subscribe pattern [58]. This pattern works by having a designated channel for the state updates on the observed object named the subject, where subscriber objects subscribe to this channel for updates enabling as many subscribers as required. Simply the subject publishes state updates to the channel and the subscribers read the state updates off the channel, allowing them to monitor the subject object. Distributed monitoring systems are required to be scalable, robust and have minimal overheads [59]. Ganglia [59] is a distributed monitoring system that has been designed following a hierarchical pattern where each node in a cluster has a Ganglia component monitoring it a these report up to single point of monitoring the client. A publish/subscribe pattern is used to send the messages from each monitoring node up the hierarchy. Ganglia uses heartbeats to detect if nodes are available for monitoring in the distributed system.

39 System Model Figure 5 System Model The system model see Figure 5 shows a traditional stream processing system with data flowing from left to right. Each Node 1A, 2A and N are doing a individual task on the data in a serial pattern of data flow. In certain cases these nodes can be parallelised. The fault tolerance enhancements are shown in blue. Each processing node will have its own assigned fault tolerant agent that monitors the node for late timing faults, builds the prediction models and chooses the most accurate prediction data if a late timing fault occurs and replays it. This design follows a decentralised approach to minimise latency when recording data, monitoring for late timing faults and replaying data as these nodes will be expected to perform in milliseconds. Having each node have its own fault tolerance removes a single point of failure for the whole cluster as only a single node will be affected. The agents do have a single manager they report to but will run even if this fails. The manager checks the state of the agents using a heartbeat design. This central agent manager enables a complete view on the current state of the system and also inform each node if it should have received data from its previous node, enabling it to overcome latency or upstream node issues using data prediction but this is out of scope and is future work.

40 Late Timing Fault Activity Figure 6 Example Fault Model To understand how the system deals with a slow processing task use case an activity diagram was designed. This use case is for a single processing node with a transient late timing fault caused by a slow processing task. For this use case the node has already seen data and trained the model. Evaluation has also taken placed and ranked each data prediction algorithm in the system the model building and evaluation runs continuously in the background when new data is seen in a hot standby design. Step 1) Stream data is enters the stream processing Node X which is attempting to perform a task but performance has degraded because of software ageing the software as suffered from non terminated threads that are causing threads to lock. Step 2) The fault tolerated agent knows that data has inputted the system but it has not been processed fully and exited in an acceptable time frame. Step 3) The fault tolerance agent raises an alert that a slow processing task is in progress therefore retrieves predicted data for that data stream with the highest evaluation score. In this use case KNN predictor with a score of 83%.

41 Step 4) The predicted data is returned and sent by the processing node which clears it for the next data tuple which will tolerate the transient late timing fault. 6.4 Fault Tolerance Agent Technical Details Figure 7 shows the internal details of each fault tolerant agent components the design of these components is now described. Figure 7 Fault Tolerant Agent Details Timing Fault Monitor Data is recorded as it enters and exits the specified processing node that is being monitored. Once data enters the system a timer watches to make sure it is processed in its acceptable time, that is defined by the user as it is task specific. If the data exits the processing node before the fault tolerance timer times out, the system continues as normal operation. If the timer does time out the data prediction is called for that data stream and then replayed. It is key that the time it takes for monitoring to record entry and exit data is as minimal as possible Fault Tolerance Controller The fault tolerance controller coordinates prediction and evaluation components and provides the new data to be modelled and retrieves the highest scoring prediction when required.

42 Prediction Data prediction has been designed to be performed asynchronously. The system is building models as lightweight prediction algorithms are used. A limited amount of data is stored, for the cloud data centre monitoring scenario it has been set to 50 values after the initial investigation revealed storing more than this had limited improvements on prediction accuracy and in some cases made it worse. The prediction follows a hot standby design where once a timeout occurs the agent can immediately get a prediction for a data stream reducing latency of the fault tolerant solution Evaluation Evaluation has been designed to be performed asynchronously. As new data enters the system evaluation is taking place for each algorithm to keep the prediction as accurate as possible. The shortened look back of data makes this feasible. Uses error rate mean squared error, root mean squared error and mean absolute error and wants this minimised to be the score. The evaluation component uses error rate (accuracy inverse) with a variance allowed for the specific scenario because of working with non integer values of 1.4%. The numerical metrics of mean absolute error, root mean squared error and mean squared error are all used and the highest scoring prediction technique has the lowest score. 6.5 Key Design Decisions Decentralised Design A decentralised design was eventually used after attempting a centralised design. The issues seen with having a single point of monitoring and fault tolerance is that the memory store is a single point of failure and bottleneck. Where a large number of different threads are attempting to write to it slowing down the entire system meaning this design does not scale. Updates are costly in a data store and data is moving in milliseconds therefore repeatedly updating the single data store became infeasible. A decentralised design was used that created a data store for each node reducing the complexity of the data store and removing the risk of thread locking with potentially thousands of threads, writing, reading and updating it. The decentralised design is much easier to manage and reduces network traffic. Having an individual agent for each processing node provides a more flexible, granular and fault tolerant solution. As there is no central point of failure or central database slowing it down. Each node is isolated so it can run reliably even if the agent manager or other agents fail.

43 Hot Standby The importance of real time systems is meeting its real time requirement therefore data prediction needs to be as fast as possible if used as a method of tolerating late timing faults. Lightweight algorithms are low cost and look back of data required to train is minimal. As once they hit a certain point of looking back across training data they become less effective therefore building them in the background is minimal as data is kept minimal. To meet the low impact on timing a hot standby approach has been designed that will mean that prediction data is always available as long as a data streams data has been seen before even if just once. Future work can involve more complex algorithms that are run in the agent manager to reduce large algorithms impacts on the system and also when processing nodes are parallelised each data stream and resulting output will be the same across each node. 6.6 Design Conclusion The design of the system changed as the project progressed and attempted prototypes were developed the major change was from a centralised approach to a decentralised approach. Once the decentralised approach had been discovered as the best design producing the solution in code was a lot simpler and manageable. A large amount of time was spent on the design but small components were agreed early on allowing the system to be built as increments and the well thought out design from an early stage meant code could be reused from earlier prototypes at each iteration.

44 Implementation 7.1 Methodology The fault tolerance solution was built using an incremental approach, where small increments were developed as further research was completed which led to new requirements, design and eventually development. A focus on the solution was to maintain a large suite of unit tests. These would act as regression tests when incrementing the solution with new features. At one point in the project a whole prototype was thrown away when choosing to switch from a centralised to a decentralised design but certain components were taken and reused as the system was well designed to be reusable. 7.2 Case Study: Cloud Monitoring Storm Topology A cloud monitoring Storm topology was developed that was enhanced with the prediction fault tolerance. This topology was formed of 18 data spouts which fed in Google trace log data from three machines and then multiplied by 6, these streams feed into the bolt EnableTestDataSplitForFaultBolts class which was required to split the data between faulty nodes and non faulty nodes. Once data passed through this bolt it went to either a SingleMachineAverageBolt or SingleMachineAverageBoltWithFaultInjection, these bolts simply averaged the usage for each machine identifier and outputted the results to a single bolt that then averaged for all machines the AllMachinesAverageUsageBolt class. 7.3 Storm Cluster The Storm topology can be run locally but is designed to scale over multiple machines. A Storm cluster was set up on the University of Leeds cloud test bed. The topology was installed on multiple virtual machines running Ubuntu. The experiments were run on two and five virtual machines, but the system tested on ten virtual machines. The Storm user interface was run to confirm the cluster was running the topology and the relevant virtual machines were running as part of the cluster. 7.4 Java Library Design The fault tolerance solution was built as an accessible library with single object defined where simply entry and exit method was required to be invoked. To send predicted data the monitored node implements an interface, as Java allows multiple interface implementations, with a single method required for replaying the data. The stream processing node

45 developers are required to implement this method but this results in a library that can be used with almost all Java applications. 7.5 Technical Details The fault tolerance solution has a single object of the PredictionFaultTolerance class that is defined in each stream processing node. Two methods are used that record entry and exit data. The stream processing node implements an interface that has a single method required for sending on predicted data. This enables the fault tolerant solution to work with any object in the Java language, therefore it is not limited to Apache Storm. When the object is defined for the fault tolerance solution monitoring, prediction and evaluation threads are all created and started. The details of these are now discussed Monitoring Component The solution places a monitoring object into a stream processing node that reads the data input and the data output of the node. It was very important that this monitoring and entering of data into the fault tolerance solution happened as quick as possible so except for entering and exiting data being recorded along with a timestamp all other tasks of the fault tolerance solution are asynchronous. When data is recorded entering the stream processing node a new timeout alert time is set on a watching thread that simply watches for timeouts. This thread is continuously running as the time cost of starting a new thread is too expensive. When the data exits the stream processing node the watch time is set to null and ignored. The process happens again when data enters the system the watch thread has a new timeout alert time set Prediction Component Once the fault tolerance object is defined the data prediction models start to be built in the background of the system. When data enters the system a prediction model is built for each data prediction technique. This means that predictions can instantly become available to enable the system to meet its timing requirements Evaluation Component Once predicted values are created and the next value exits the processing node in the stream processing system it is compared to the predicted value and the error rate, mean squared error, root mean squared error and mean absolute error are calculated and totalled up. The score wants to be minimised and then prediction techniques are ranked by their evaluation score for each data stream that passes through the system, so if a data stream prediction is required the best prediction can be selected.

46 Slow Processing Fault Detected Scenario If a slow processing task occurs the monitoring delay watcher thread will timeout. As the prediction models have already been built and the evaluation ranking will have taken place the system can instantly get a predicted value. The stream processing nodes implemented method replaypredicteddata from the fault tolerance solution MonitorCorrectionReplay interface for sending the predicted data is called, the data is sent and the node is now clear for new data tolerating the slow processing task. 7.6 Technical Challenges Multiple Threads As the proposed fault tolerant system was required to work quickly and non blocking to the Storm processing node a large amounts of Java threads were used. A common design pattern producer and consumer was used with the Java concurrent collections in particular the priority blocking queue object. The priority blocking queue allows data to be put onto it and data to take from it if its got data to take. This enables multiple threads to put data on the queue and then multiple threads to take data off the queue without concurrent issues. This increases the efficiency of the application as a larger amount of threads can be used increasing performance. The watching threads that take data off the queue also watch the queue more efficiently as the notification to take as been developed by Oracle, rather than looping indefinitely and checking for data on a queue. To abstract away the management of threads Springs task executor object was used which manages the running and shutting down of threads once complete. 7.7 Programming Standards & Code Quality General Design and Implementation When developing the code a focus was on following key principles of software development by having good coding style by using comments, readable variable and method names. Programming to an interface over implementation to make the code more flexible and using some key Java technologies such as Maven and Spring to enhance to the application and conform to industry standards Project Management Tool: Maven When developing software following convention is important for the development of the fault tolerant solution. Maven [60] was used that provides a set of standards, a project lifecycle and dependency management system. Java developers will be familiar with Maven and the project file structure therefore can quickly find the parts of the code they wish such as the

47 main source code or the unit tests. Maven makes it easy to build and distribute software by running simple Maven commands which run all unit tests, build an executable jar file and deploy it to a repository which will include all java library dependencies required Dependency Injection: Spring Framework When developing object oriented code multiple objects interact with each other forming dependencies. Traditional design was for each object to define or obtain its own references to objects it works with. This has the potential to lead to tightly coupled code that is not flexible and is also difficult to test. Spring [61] provides a framework that allows dependency injection of objects, where objects provide their dependences at creation time by an external entity in this case a Spring bean XML file therefore dependencies are injected into objects [62]. Using dependency injection improves code quality by producing loosely coupled objects that have a single responsibility, follow the open/closed principle of object oriented programming (OOP) [63] and are isolated resulting in code that is easier to test Software Testing Software testing is key for any software project as the reliability of the code needs to be checked. Throughout the incremental development approach changes were made therefore having a suite of tests acted as regression tests to make sure changes did not break the existing code. Using Maven as a build tool provided the management of dependencies for JUnit, Mockito [64] and Cobertura [65] that were used to provide tests. Following good programming practices of programming to the interface rather than implementation, using dependency injection and having the Maven build tool to run tests resulted in higher quality and easier to write unit tests. The unit tests were written using JUnit and run using Maven. As OOP uses multiple objects dependencies are formed. Unit tests should be written to test a single class not all of the classes it collaborates with [66]. Therefore in appropriate cases mock objects were used to reduce the complexity of tests and to speed them up. Mockito [64] provides a Java framework where objects can easily be mocked, where method calls can have mocked results returned. This results in testing a single class that is isolated from the rest of the code, allowing dependent classes to change without affecting this unit test. Programming to the interface also improved the use of Mockito as interfaces were mocked rather than the implementations reducing tightly coupled tests with the code. To evaluate the quality of the tests the code coverage tool Cobertura [65] was used. Cobertura measures the line coverage and the branch coverage of the code see Figure 8.

48 Figure 8 Cobertura Code Coverage Results Code Quality Tools To check anything missed from using the industry standards above Findbugs [66] was run on the code which is a static analyser that will reveal any common mistakes Java developers do. As the code was developed to a high standard it only found one warning message which was acceptable for the design of this code. 7.8 Experimentation Code To enable the experiments recording of results code was developed that logged the results for single nodes and end to end. Fault injection code was also developed that injected simulated transient slow processing tasks into the Storm system. As a large amount of the results were recorded to comma separated value files helper Java applications were developed that extracted and joined the information. For example the time stamps of data entering were matched to the data exiting by a unique message identification number and the average time to complete was calculated.

49 Experimentation Plan 8.1 Summary To evaluate the success of using data prediction to overcome transient late timing faults a number of experiments were performed. The experiments were performed on a example real time stream processing system developed with Storm. The system used the Google trace log data set [13] to monitor, calculate and record the average CPU and Memory usage of different servers in a Google cluster. Different parameters were set for the experiments such as different size data inputs, number of faults and the number of virtual machines the stream processing system was run on. The size of inputs was changed by increasing and decreasing the speed at which data is emitted into the Storm system. The number of faults was changed by having more bolts with faults injected starting with 10% for small faults and then 45% for large faults. The experiments were run on three solutions, a baseline solution with no fault tolerance, using Storms upstream backup fault tolerance method and using the data prediction method. 8.2 Metrics Five key metrics were investigated and recorded for the experiments. Table 3 Experiment Metrics shows these metrics, what they recorded and how they were recorded. The columns Node with Faults and End to End show if they were recorded at the nodes with faults injected and/or from the start of the stream processing system to the end of the entire system. Overheads and prediction accuracy were only recorded at the nodes with faults as these are not relevant without the fault tolerant agent. The overheads want to measure the impact of the fault tolerant agents therefore these metrics were only recorded on the virtual machines with fault tolerant agents running. Table 3 Experiment Metrics Metric Info Node with Faults End to End Method Throughput Timings Measures the total number of data points processed over a time period each second. Measures the average times of data to be processed each second. Yes Yes Java code logs the throughput each second to a file. Yes Yes Java code logs the recorded entry time per second. The output is

50 Overheads Scale Prediction Accuracy Metrics for CPU, Memory and Network Usage over the time period of each experiment. The experiments were run over different size virtual machines to test the system at scale. Different data size inputs were used also. Accuracy and Mean Absolute Error were recorded for each prediction in each recorded experiment. Absolute Error was the only used numerical metric as research has shown that all numerical metrics will usually reveal the best method [55]. As the data is non integer accuracy was allowed a variance of 1%. also recorded. Each data point has a unique ID which is then referenced and the time length is calculated. Yes No The usage statistics from TOP were logged to a file every 5 seconds. VNStat was used to record network download and upload. Yes Yes University of Leeds cloud test bed with Ubuntu virtual machines was used, managed by OpenNebula. Yes No The prediction technique, actual values and the predicted values were logged and then the prediction metrics were calculated offline so it did not impact performance. 8.3 Experiments Plan An experiments plan was formed before beginning the experiments. Where a table of the different parameter settings were agreed for the experiments. Fixed Parameters: Length of each experiment 3 Minutes. Experiments run on the University of Leeds cloud test bed using Ubuntu virtual machines.

51 Fixed amount of data inputs 18. Parameters: Baseline with no fault tolerance, Upstream Backup Storms method and developed Prediction solution. Small Faults is 10% of the system for calculating average usage (10 workers) where faults are injected. Large Faults is 45% of the system for calculating average usage (10 workers) where faults are injected. Data input for large data is 250 milliseconds each of the 18 spouts emits data into the system. Data input for large data is 500 milliseconds each of the 18 spouts emits data into the system. Data input for large data is 1000 milliseconds each of the 18 spouts emits data into the system. 2 Virtual Machines and 5 Virtual Machines. Storm requires a head node that simply supervises the cluster. Virtual machines that actually complete tasks are referred to as worker virtual machines. 8.4 Experiment Assumptions System will detect failures with a specified timeout, set for this example. Transient failures tolerated. Not correcting the faults. Timeout period is default of 30 seconds for upstream backup. Assume data values are independent of each other. Not using data predicted already. Sufficient data already trained for prediction by letting the system run for 10 seconds. Multiple processing nodes set to 44 fixed where 10 are for calculating averages. Fixed number of data inputs set to 18. Failure isolation between VMs. Failure type due to memory leaks, memory bloats or un-terminated threads causing locks. Assume slow processing task gives value after. Inject failure every 30 seconds that lasts for 8 seconds, sleep threads to reduce throughput. Fixed time threshold for timeout. Input data is none faulty. No missing data.

52 Fault Injection Simulator To enable fault injection and logging of the results a Java library was written that was referenced in the stream processing systems. This library contained the fault injecting code that simulated a slow processing tasks. To use this code the library was imported into the project, a ExperimentsSystem object was defined and a fault injection consider method was called in the code that considered if a fault should be injected or not. The slow processing task was simply simulated by sleeping the thread for 4 seconds and then releasing. The faults were injected every 35 seconds for a period of 8 seconds. Sleeping the thread simulates a slow processing task that could be caused by unterminated threads, memory leaks or bloats or thread locks. To record the results logging system objects were defined as part of the ExperimentsSystem that wrote the results to files. Issues were found where recording the prediction accuracy results to file were too slow and impacted system performance as baseline and upstream back up did not require recording this result therefore Java logging was used, but this will still slightly slow the performance of prediction.

53 Experiments Results 9.1 Experiment Results Summary The results have revealed a strong performance for using data prediction for tolerating late timing faults. The timing and throughput results in particular have highlighted that data prediction can hide the impacts of slow processing tasks in real time stream processing systems. Data prediction strongly performs against no fault tolerance and upstream back where both are blocked for the entire simulated transient fault. An unexpected result is that Storm buffers the data and then can quickly push through data seeing large spikes of throughput after delays when not using the prediction fault tolerant solution. When increasing the scale of the data the prediction did not perform as strongly as the data detection for faults was too slow as set when working with small data, this limitation was expected and future would look to improve this. This can easily be resolved by tuning the parameters of each node or implementing a learning system for the average time it takes to complete a single data tuple. Network usage is key in large compute clusters and data prediction performs strongly against upstream backup. The CPU usage is larger but data centre and Storms scheduling of tasks can limit this impact. The data prediction algorithms used were generic and simple yet they still performed better than expected with an average of 63% for accuracy across the total experiments. This value should improve greatly though with specific algorithms designed for a specific domain using the proposed solution. A selection of results showing interesting points have been analysed in the report but fuller results can be found in Appendix A starting on page Timing and Throughput Results Small Faults Single Worker Virtual Machine Small Data For the experiments that were performed on a single worker virtual machine with small data the expected pattern was seen. The results shown in Figure 9 and Figure 10 are for a single worker node with a fault and then Figure 12 and Figure 13 show the end to end results of the entire stream processing system. The data throughput can be seen to drop at each of the transient fault injections at around 25, 65 and 105 seconds along the x axis. The prediction technique developed was not impacted by the transient fault but baseline and the upstream

54 backup approach were severely impacted and throughput dropped down to 0. Interestingly once the slow processing nodes simulated fault had cleared Storm could push through a lot more data in a shorter period of time as shown with the spikes in the system. This most likely is because the data is buffered by the affected node which can then process it very quickly. In reality though more complex tasks would see a longer lasting effect because the processing node may take a longer time to complete rather than a few milliseconds as in this example case. Viewing the timing it takes for the processing node to complete a task where the fault is injected upstream backup acts as poorly as the baseline solution where the slow processing task is not even noticed as a fault and the system waits seeing a large spike in how long the task takes to complete 4000 milliseconds the entire time of the delay. Where the deviation reveals the changes from average. The average time taken for the data when using the prediction solution has minimal impacts there are only small increases. Figure 9 Small Data Small Faults Data Throughput Single Node Figure 10 Small Data Small Faults Single Node Average Time

55 Figure 11 Small Data Small Faults Single Node Deviation Analysing the end to end results of throughput and timings for a single worker virtual machine, the results are as expected where spikes are seen for the baseline and upstream backup solutions for the time it takes to process the task, from entering the stream processing system to exiting it. For the end to end throughput results throughput still is processed by the stream processing system for baseline and upstream backup when the late timing fault is injected as the single virtual machine has multiple processing workers on it and data is processed by these other workers but the throughput still drops by about 50% for baseline and upstream backup where the fall in prediction is unnoticeable see Figure 13. Figure 12 Small Data Small Faults End to End Average Times

56 Figure 13 Small Data Small Faults End to End Data Throughput Medium Data For medium size data the results are very similar as the small data. These can be found in Appendix A at page 70. Large Data As a fixed timeout value was used for the experiments when the large amount of data was experimented with you can see the same results for baseline and upstream backup when compared to smaller data, but there is more of a drop in throughput for prediction than with smaller amounts of data. This is expected as the timeout was set when working with smaller data and the amount of data processed is more as more data is entering the system the worker can process it quicker, but the prediction method is processing it still at the same speed as with smaller data. This could be a future improvement where the average time it takes to process a data tuple by the single node can be learnt by the fault tolerance solution and set increasing flexibility of the solution. The parameter for detecting faults was not changed throughout the experiments as it was an agreed fixed parameter at the start of experiments. Figure 14 Large Data Small Faults Single Node Throughput

57 Four Worker Virtual Machines Increasing the virtual machines did not make much of an impact on the experimental results as shown the end to end throughput for large data still dropped down. The reason this did not make much of a difference is because the case study was not large enough to really put the system under pressure and future work will be on the applying this to a larger scale problem domain. Although this has revealed limitations of Storm where data was still sent to the slow processing node and not routed to nodes without faults as Storm was unaware of the slow processing task as it did not see it as a fault. When viewing the end to end average time, prediction does eventually begin to slow this is caused because the node was not tuned to this data size it was tuned for the small data size see Figure 12 where the timing is kept minimal. The flexibility of the proposed solution using prediction can be tuned for each node to improve performance for each individual node to overcome this issue. This rise in the average time is increased because of the data is buffered and waiting and prediction has not be tuned to timeout fast enough for this larger amount of data that is usually processed quicker. Figure 15 Large Data Small Faults End to End Throughput

58 Figure 16 Large Data Small Faults End to End Average Time Large Faults Single Worker Virtual Machine For the larger faults four processing tasks were faulty and five were non faulty. Revealing how the system would deal with 45% errors in the system. The results show that it does appear to take longer to set up the prediction system as spikes can be seen between 1 and 6 seconds. The end to end results have been focused on as they will show how the system dealt as a whole with a larger number of faults. The amount of data throughput shows that the prediction system deals better with the transient faults better as the throughput does not really drop but the baseline and upstream backup results drop nearly 80% compared to normal running for the small data. The timings do appear to start to increase for the prediction techniques throughout the experiments but this could be due to unseen data being used at different nodes or the synchronised faults exaggerating the time it takes or issues seen with the long tail problem. Small Data Figure 17 Small Data Large Faults End to End Throughput

59 Figure 18 Small Data Large Faults End to End Average Times Four Worker Virtual Machines Large Data Running with four worker virtual machines with large data and large faults shows a similar pattern as before the end to end results show that the prediction is not as impacted as much as the baseline, but upstream backup is also performing the strongest seen in all the experiments which is most likely because Storm knows where items are being acknowledge so is routing the data more efficiently. Upstream backup is still not as efficient as using prediction especially if the prediction was tuned for larger data. The set up time for all three methods is also longer and takes time as shown in the data throughput. Figure 19 Large Data Large Faults End to End Throughput

60 Figure 20 Large Data Large Faults End to End Average Times 9.3 Resource Usage Results Combined Small and Large Faults The results for all types of data and both small and large faults have been averaged to see a overall view of the resource usage of the prediction technique see Figure 21, Figure 22, Figure 23 and Figure 24. As expected the CPU usage of the prediction technique is higher by around 10% on the single worker machine but nearly 20% on when using four worker virtual machines. This is because Storm splits out the program more and took other Storm worker processes off of that virtual machine and onto another virtual machine. For the network usage the results show as expected that upstream backup would have a much higher network upload value as it has to send acknowledgement messages to the previous node up the whole chain to the data source. Prediction technique has low values the same as the baseline. This is important as data centres require that network congestion is minimised. Figure 21 Single Work Virtual Machine Average Resource Usage

61 Figure 22 Single Worker Virtual Machine Average Network Usage Figure 23 Four Worker Virtual Machines Average Resource Usage Figure 24 Four Worker Virtual Machines Average Network Usage

62 Prediction Accuracy Results The prediction accuracy was allowed a distance value of 1.4% as this metric is designed for integer values and the values for CPU usage and memory usage are non integer. The accuracy wants to be the highest percentage as possible while the mean absolute error wants to be minimised. The variance allowed in the accuracy may result in the same percentage value but a different mean absolute error value. The complete average was 63% which is relatively high given fairly simple algorithms were used and were not tuned to the data type. In the results all the different algorithms were used showing the importance of having a selection of different algorithms. Figure 25 shows example output for the prediction data showing the predicted and actual values. The full results of data prediction can be seen in the tables below Small Faults Table 4 Single Worker Virtual Machine Small Faults Prediction Metrics Small Data Medium Data Large Data Accuracy 64% 61% 62% Mean Absolute Error Table 5 Four Worker Virtual Machines Small Faults Prediction Metrics Small Data Medium Data Large Data Accuracy 64% 61% 64% Mean Absolute Error Large Faults Table 6 Single Worker Machine Large Faults Prediction Metrics Small Data Medium Data Large Data Accuracy 65% 60% 64% Mean Absolute Error Table 7 Four Worker Virtual Machines Large Faults Prediction Metrics Small Data Medium Data Large Data

63 Accuracy 65% 61% 69% Mean Absolute Error Figure 25 Sample Worker Log Data for Prediction Results

64 Evaluation 10.1 Solution Evaluation Overall the newly proposed and developed prediction solution has been a success although some relatively small additions would improve it greatly. The focus of the solution was on overcoming late timing faults that were transient and caused by slow processing tasks. The solution performed strongly on providing a solution to this problem that the current techniques such as upstream backup do not deal with efficiently. The timing requirements were met when using prediction rather than no fault tolerance or using upstream backup that did not even detect the slow processing task and behaved the same as the baseline but with increased network overheads. Upstream backup behaves this way as its timeout value is set at the default 30 seconds therefore never detected the fault. This value is global to all processing nodes in the Storm system limiting it where as the prediction technique can be personalised for each processing node if required improving its performance. Interestingly Storm can process the data quickly once the slow processing task had blocked as the data is buffered, but it does run the risk of this buffer overflowing and losing data. A industry Storm topology may have processing nodes that are more complex and take longer leading to a larger effect on the system as the node is slow and the buffer is full leading to possible long tail problem. The solution does have limitations which is that the way it overcomes transient late timing faults is by using prediction removing 100% accuracy from the system. This is domain specific therefore it is only applicable to certain domains. Twitter uses approximate results [11] as it wants to meet its real time requirements, any Storm topology or real time stream processing system that uses machine learning algorithms will not be 100% accurate therefore this does not matter either for example Mobile Millennium [28]. The increased CPU usage could be an issue but this is evened out by actually dealing with the fault and reducing the network overheads when compared to upstream backup Future Work Future work for this approach is to look into improving the detection of the late timing faults this could be by tuning the timeout time so that it is learned from the average processing time of the individual node. Or by predicting the fault which is a whole field of research therefore techniques could be attempted that have proved successful. The prediction system requires the head prediction node to be built which would enable monitoring of all processing nodes in the system. Once all processing nodes in the

65 system are monitored they can be mapped to see where blockages are being caused by other faults such as latency in the network. If latency in the network is blocking a downstream node from working because of data dependences then it can be informed that data should have arrived so can send predicted data in the same way. Currently the nodes are isolated so there is no way of knowing if data should have entered the processing node or not. Applying the prediction system to a larger scale Storm topology in a industry domain would improve the experiments and see its performance on larger scale problems. Work would be needed in developing a appropriate algorithm for that domain to increase prediction accuracy to an acceptable level. Once more complex prediction algorithms are formed the prediction ranking system could be improved to reduce CPU consumption. Scheduling algorithms that only build the model for the most successful algorithm for the last 10 runs and randomly selects from the rest to see if it is overtaken by another algorithm could be used to greatly reduce CPU consumption Personal Reflection Personally the project has been a great success a novel fault tolerant solution has been developed and experimented on with a Storm topology over a cluster of virtual machines and the results have appeared as expected. Not only has a developed solution been created a wealth of knowledge has been consumed, that has been very interesting and relevant for future work experiences. The time consuming part of the project was reviewing all the papers in stream processing and finding a current problem that exists which had not been solved particularly well. Although all the reading and paper reviews completed has led to a strong knowledge base in this area. The amount of knowledge consumed resulted in a problem with too larger scope that was initially unmanageable and guidance led to this being reduced down to a specific problem that needed to be solved. Rather than just touching the surface of a large problem and not making any research impact. Even though finding the problem was time consuming this was expected as this was a new challenge because of the novelty of the solution and the problem. Not only has the academic knowledge of research improved technical skills have improved greatly working with distributed systems was a new technical challenge over multiple virtual machines. Java skills have now been extended to work with distributed systems and finally building a strong knowledge of multi threaded programming in Java. Using an incremental and agile approach to the development and project was important as defining the problem took a large part and the resulting code did change throughout the project. A major potential blocker identified was setting up the Storm cluster as this was an

66 unknown area. Therefore this was completed early on in the project before developing the actual fault tolerance solution so the timeline could be managed easier. Keeping a log book throughout the project kept management of documents in one place. Starting the report early was also a success where a bullet point report was written at the earliest possible stage to keep the project on track.

67 Conclusion A new technique for overcoming transient late timing faults caused by slow processing tasks has been developed and its feasibility has been evaluated. This new technique uses data prediction to overcome these late timing faults for each individual processing node in the stream processing system. The feasibility analysis has shown that data prediction can mask the late timing fault impacts of slow processing tasks in stream processing systems. Key results are that data throughput is maintained at a similar rate when using the data prediction solution even though a slow processing task is causing a late timing fault. The latency of the slow processing task is masked as the timing of processing the tuples is minimally increased, where the main increase in timing is when the fault is first detected. The data prediction technique can be scaled up to multiple virtual machines and is designed so that it can be tuned for each processing node. When compared to the current technique for fault tolerance a form of upstream backup in Storm it is strongly outperformed. Upstream backup has a global timeout value that is defaulted to 30 seconds and it does not detect the transient late timing fault injected as it only lasts for 8 seconds. The prediction technique can easily be enhanced so that is can improve late timing faults throughout the system caused by data omission where the data does not enter the processing node. It can be notified that data should have entered the node but network latency or a failure upstream has blocked its required data. If this happens a head node can inform the impacted node and the prediction data is used allowing data to flow through the system tolerating the data omission fault.

68 References [1] Min Chen, Mao Shiwen, Zhang Yin, and V.C Leung, Big data: related technolgies, challenges and future prospects. : Springer, [2] Changqing Ji, Li Yu, Qiu Wenming, Awada Uchechukwu, and Li Keqiu, Big data processing in cloud computing environments In Pervasive Systems, Algorithms and Networks (ISPAN), th International Symposium on, pp IEEE, [3] Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, Ion Stoica Matei Zaharia, "Discretized streams: Fault-tolerant streaming computation at scale." Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM," [4] Hermann Kopetz, Real-time systems: design principles for distributed embedded applications. Springer Science & Business Media," [5] D Robins, "Complex event processing. In Second International Workshop on Education Technology and Computer Science. Wuhan, February [6] Sirish Chandrasekaran et al., "TelegraphCQ: continuous dataflow processing," in Proceedings of the 2003 ACM SIGMOD international conference on Management of data, 2003, pp [7] M., Balakrishnan, H., Madden, S. R., & Stonebraker, M Balazinska, "Fault-tolerance in the Borealis distributed stream processing system. ACM Transactions on Database Systems (TODS), 33(1), 3," [8] B Ellis, Real-time analytics: Techniques to analyze and visualize streaming data. : John Wiley & Sons, [9] G., Botta, A., De Donato, W., & Pescapè, A Aceto, "Cloud monitoring: A survey. Computer Networks, 57(9), ," [10] Michael, Ug ur Çetintemel, and Stan Zdonik. Stonebraker, "The 8 requirements of realtime stream processing," ACM SIGMOD Record 34.4, pp , [11] Ankit, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel, Sanjeev Kulkarni, Jason Jackson et al Toshniwal, twitter," In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pp , [12] Kyungmin, David Chu, Eduardo Cuervo, Y. Degtyarev, S. Grizan, J. Kopf, Alec Wolman, and Jason Flinn Lee, "Outatime: Using speculation to enable low-latency continuous

69 interaction for mobile cloud gaming," In Proc. of MobiSys, May [13] Google. (2015, Aug.) Google Cluster Data V2. [Online]. HYPERLINK " [14] Git. (2015, Sep.) Git. [Online]. HYPERLINK " [15] Atlassian. (2015, Sep.) Bitbucket. [Online]. HYPERLINK " [16] Stewart Tansley, Krisitin Tolle Toney Hey, "The Fourth Paradigm: Data Intensive Scientific Discover," vol. Second Printing Version, October [17] D., & Reddy, C. K Singh, "A survey on platforms for big data analytics," Journal of Big Data, 2(1), pp. 1-20, [18] P., & Grance, T Mell, "The NIST definition of cloud computing.," [19] C., Farrell, D. M., Lee, M., Stone, P. D., Thibault, S., & Tucker, S Ballard, IBM InfoSphere Streams Harnessing Data in Motion. : IBM Redbooks, [20] Andrey, Christof Fetzer, Pascal Felber Brito, "Minimizing latency in fault-tolerant distributed stream processing systems." Distributed Computing Systems, ICDCS'09. 29th IEEE International Conference on. IEEE., [21] Sean T Allen, Matthew Jankowski, and Peter Pathirana, Storm Applied Strategies for real-time event processing. NY: Manning Publications Co, [22] Tom White, Hadoop: The Definitive Guide. : O'Reilly Media, Inc, [23] Apache. Apache Spark. [Online]. HYPERLINK " [24] A., Laprie, J. C., Randell, B., & Landwehr, C Avižienis, "Basic concepts and taxonomy of dependable and secure computing. Dependable and Secure Computing, IEEE Transactions on, 1(1), ," August [25] Pankaj Jalote, Fault tolerance in distributed systems.: Prentice-Hall, Inc, [26] J. H. Balazinska, M., Rasin, A., Çetintemel, U., Stonebraker, M., & Zdonik, S Hwang, "High-Availability Algorithms for Distributed Stream Processing," In Data Engineering, ICDE Proceedings. 21st International Conference on (pp ), April [27] L., He, W., & Li, S Da Xu, "Internet of Things in industries: A survey," IEEE

70 Transactions, pp , [28] Teodor Moldovan, Matei Zaharia, Samy Merzgui, Justin Ma, Michael J. Franklin, Pieter Abbeel, Alexandre M. Bayen Timothy Hunter, "Scaling the Mobile Millennium System in the Cloud," Proceedings of the 2nd ACM Symposium on Cloud Computing, p. 28, October [29] Matthew A Russell, Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn, Google+, GitHub, and More. : O'Reilly Media, Inc., [30] T., Cudre-Mauroux, P., Grund, M., & Perroud, B Chardonnens, "Big data analytics on high Velocity streams: A case study," In Big Data, 2013 IEEE International Conference, pp , [31] M. A., Hellerstein, J. M., & Brewer, E Shah, "Highly available, fault-tolerant, parallel dataflows," in In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, 2004, pp [32] G., Robertson, W., Kher, V., & Kemmerer, R Vigna, "A stateful intrusion detection system for world-wide web servers," n Computer Security Applications Conference, vol. 19th Annual, pp , December [33] Jeong-Hyon, Magdalena Balazinska, Alexander Rasin, Ugur Cetintemel, Michael Stonebraker, Stan Zdonik Hwang, "A comparison of stream-oriented high-availability algorithms," Technical Report TR-03-17, Computer Science Department, Brown University, [34] Stephen, Frank Armour, Juan Antonio Espinosa, William Money Kaisler, "Big data: Issues and challenges moving forward," n System Sciences (HICSS), th Hawaii International Conference on, pp , [35] Raul, Matteo Migliavacca, Evangelia Kalyvianaki,Peter Pietzuch Castro Fernandez, "Integrating scale out and fault tolerance in stream processing using operator state management," In Proceedings of the 2013 ACM SIGMOD international conference on Management of data, 2013, pp [36] B., Mayer, R., Ramachandran, U., Rothermel, K., & Völz, M Koldehofe, "Rollbackrecovery without checkpoints in distributed event processing systems. In Proceedings of the 7th ACM international conference on Distributed event-based systems," pp , 2013 June. [37] Apache. (2015) Apache Storm. [Online]. HYPERLINK "

71 [38] Apache. (2015) Apache SAMZA. [Online]. HYPERLINK " [39] L., Robbins, B., Nair, A., & Kesari, A. Neumeyer, "S4: Distributed stream computing platform," in Data Mining Workshops (ICDMW), 2010 IEEE International Conference, 2010, pp [40] Amazon. (2015) Amazon Web Services. [Online]. HYPERLINK " [41] Apache. (2015, Mar.) Apache Storm. [Online]. HYPERLINK " [42] N., & Andrienko, G Andrienko, Exploratory analysis of spatial and temporal data: a systematic approach. : Springer Science & Business Media, [43] Kai Wang and Franklin Manoj, "Highly accurate data value prediction using hybrid predictors," In Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, pp. pp , December [44] M. F., Marlowe, T. J., Stoyen, A. D., & Tsai, G. Younis, "Statically safe speculative execution for real-time systems," Software Engineering, IEEE Transactions on, 25(5),, pp , [45] Brad Calder, Glenn Reinman, and Dean M. Tullsen, "Selective value prediction," in Proceedings of the 26th International Symposium on Computer Architecture, [46] Huiyang, C. ying Fu, Eric Rotenberg, T. Conte Zhou, "A study of value speculative execution and misspeculation recovery in superscalar microprocessors," Department of Electrical & Computer Engineering, North Carolina State University, pp. 23, Tech. Rep, [47] J. H., Xing, Y., Çetintemel, U., & Zdonik, S Hwang, "A cooperative, self-configuring highavailability solution for stream processing. In Data Engineering, ICDE IEEE 23rd International Conference on (pp ). IEEE.," April [48] Peter Garraghan, Xue Ouyang, Paul Townend, and Xu Jie, "Timely Long Tail Identification through Agent Based Monitoring and Analytics," Real-Time Distributed Computing (ISORC), vol. EEE 18th International Symposium, pp , [49] J Dean and S Ghemawat, "MapReduce: simplified data processing on large clusters," Communications of the ACM, vol. 51(1), pp , 2008.

72 [50] V., Harper, R. E., Heidelberger, P., Hunter, S. W., Trivedi, K. S., Vaidyanathan, K., & Zeggert, W. P Castelli, "Proactive management of software aging. IBM Journal of Research and Development, 45(2), ," [51] M., Li, L., Vaidyanathan, K., & Trivedi, K. S. Grottke, "Analysis of software aging in a web server. Reliability, IEEE Transactions on, 55(3), ," [52] Y., Kintala, C., Kolettis, N., & Fulton, N. D. Huang, "Software rejuvenation: Analysis, module and applications. In Fault-Tolerant Computing, FTCS-25. Digest of Papers., Twenty-Fifth International Symposium on (pp ). IEEE.," June [53] The University of Edinburgh. (2015, Jan.) Geosciences. [Online]. HYPERLINK " [54] Ian Witten and Eibe Frank, Data Mining: Practical machine learning tools and techniques. : Morgan Kaufmann, [55] Nathan Marz and James Warren, Big Data: Principles and best practices of scalable realtime data systems. : Manning Publications Co., [56] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides, Design patterns: elements of reusable object-oriented software. : Pearson Education, [57] Gregor Hohpe and Bobby Woolf, Enterprise integration patterns: Designing, building, and deploying messaging solutions. : Addison-Wesley Professional, [58] David Culler, N. Brent Chun, and L. Matthew Massie, "The Ganglia Distributed Monitoring System: Design, Implementation, and Experience," Parallel Computing,, no. 30(7), pp , [59] John Casey et al., Maven: The Definitive Guide. : O'Reilly, [60] Spring. (2015, Aug.) Spring. [Online]. HYPERLINK " [61] Craig Walls and Ryan Breidenbach, Spring in Action. Greenwich: Manning Publications Co, [62] R. C Martin, Agile software development: principles, patterns, and practices. : Prentice Hall PTR, [63] Mockito Community. (2015, Aug.) Mockito. [Online]. HYPERLINK "

73 [64] Cobertura. (2015, Apr.) Cobertura. [Online]. HYPERLINK " [65] Lasse Koskela, Effective Unit Testing: A guide for Java developers. New York: Manning, [66] University of Maryland. (2015, Aug.) FindBugs - Find Bugs in Java Programs. [Online]. HYPERLINK " [67] Huiyang Zhou, C. ying Fu Eric Rotenberg, and T. Conte, "A study of value speculative execution and misspeculation recovery in superscalar microprocessors," Department of Electrical & Computer Engineering, North Carolina State University, 2000.

74 Appendix A External Materials 13.1 Project Timeline Version 3

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

White Paper. How Streaming Data Analytics Enables Real-Time Decisions White Paper How Streaming Data Analytics Enables Real-Time Decisions Contents Introduction... 1 What Is Streaming Analytics?... 1 How Does SAS Event Stream Processing Work?... 2 Overview...2 Event Stream

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane BIG DATA Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management Author: Sandesh Deshmane Executive Summary Growing data volumes and real time decision making requirements

More information

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015 Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Unified Batch & Stream Processing Platform

Unified Batch & Stream Processing Platform Unified Batch & Stream Processing Platform Himanshu Bari Director Product Management Most Big Data Use Cases Are About Improving/Re-write EXISTING solutions To KNOWN problems Current Solutions Were Built

More information

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing

More information

High Availability Essentials

High Availability Essentials High Availability Essentials Introduction Ascent Capture s High Availability Support feature consists of a number of independent components that, when deployed in a highly available computer system, result

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Cloud Computing Trends

Cloud Computing Trends UT DALLAS Erik Jonsson School of Engineering & Computer Science Cloud Computing Trends What is cloud computing? Cloud computing refers to the apps and services delivered over the internet. Software delivered

More information

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015 Pulsar Realtime Analytics At Scale Tony Ng April 14, 2015 Big Data Trends Bigger data volumes More data sources DBs, logs, behavioral & business event streams, sensors Faster analysis Next day to hours

More information

How to Do/Evaluate Cloud Computing Research. Young Choon Lee

How to Do/Evaluate Cloud Computing Research. Young Choon Lee How to Do/Evaluate Cloud Computing Research Young Choon Lee Cloud Computing Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum Siva Ravada Senior Director of Development Oracle Spatial and MapViewer 2 Evolving Technology Platforms

More information

Data Refinery with Big Data Aspects

Data Refinery with Big Data Aspects International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 655-662 International Research Publications House http://www. irphouse.com /ijict.htm Data

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Cloud computing - Architecting in the cloud

Cloud computing - Architecting in the cloud Cloud computing - Architecting in the cloud anna.ruokonen@tut.fi 1 Outline Cloud computing What is? Levels of cloud computing: IaaS, PaaS, SaaS Moving to the cloud? Architecting in the cloud Best practices

More information

Amazon EC2 Product Details Page 1 of 5

Amazon EC2 Product Details Page 1 of 5 Amazon EC2 Product Details Page 1 of 5 Amazon EC2 Functionality Amazon EC2 presents a true virtual computing environment, allowing you to use web service interfaces to launch instances with a variety of

More information

Highly Available Mobile Services Infrastructure Using Oracle Berkeley DB

Highly Available Mobile Services Infrastructure Using Oracle Berkeley DB Highly Available Mobile Services Infrastructure Using Oracle Berkeley DB Executive Summary Oracle Berkeley DB is used in a wide variety of carrier-grade mobile infrastructure systems. Berkeley DB provides

More information

Case Study - I. Industry: Social Networking Website Technology : J2EE AJAX, Spring, MySQL, Weblogic, Windows Server 2008.

Case Study - I. Industry: Social Networking Website Technology : J2EE AJAX, Spring, MySQL, Weblogic, Windows Server 2008. Case Study - I Industry: Social Networking Website Technology : J2EE AJAX, Spring, MySQL, Weblogic, Windows Server 2008 Challenges The scalability of the database servers to execute batch processes under

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional

More information

Hadoop in the Hybrid Cloud

Hadoop in the Hybrid Cloud Presented by Hortonworks and Microsoft Introduction An increasing number of enterprises are either currently using or are planning to use cloud deployment models to expand their IT infrastructure. Big

More information

A Brief Introduction to Apache Tez

A Brief Introduction to Apache Tez A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform Page 1 of 16 Table of Contents Table of Contents... 2 Introduction... 3 NoSQL Databases... 3 CumuLogic NoSQL Database Service...

More information

Assignment # 1 (Cloud Computing Security)

Assignment # 1 (Cloud Computing Security) Assignment # 1 (Cloud Computing Security) Group Members: Abdullah Abid Zeeshan Qaiser M. Umar Hayat Table of Contents Windows Azure Introduction... 4 Windows Azure Services... 4 1. Compute... 4 a) Virtual

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Data Management in the Cloud

Data Management in the Cloud Data Management in the Cloud Ryan Stern stern@cs.colostate.edu : Advanced Topics in Distributed Systems Department of Computer Science Colorado State University Outline Today Microsoft Cloud SQL Server

More information

Real Time Big Data Processing

Real Time Big Data Processing Real Time Big Data Processing Cloud Expo 2014 Ian Meyers Amazon Web Services Global Infrastructure Deployment & Administration App Services Analytics Compute Storage Database Networking AWS Global Infrastructure

More information

Radware ADC-VX Solution. The Agility of Virtual; The Predictability of Physical

Radware ADC-VX Solution. The Agility of Virtual; The Predictability of Physical Radware ADC-VX Solution The Agility of Virtual; The Predictability of Physical Table of Contents General... 3 Virtualization and consolidation trends in the data centers... 3 How virtualization and consolidation

More information

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of

More information

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing WHAT IS CLOUD COMPUTING? 2

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing WHAT IS CLOUD COMPUTING? 2 DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing Slide 1 Slide 3 A style of computing in which dynamically scalable and often virtualized resources are provided as a service over the Internet.

More information

Are You Ready for Big Data?

Are You Ready for Big Data? Are You Ready for Big Data? Jim Gallo National Director, Business Analytics February 11, 2013 Agenda What is Big Data? How do you leverage Big Data in your company? How do you prepare for a Big Data initiative?

More information

Architectures for massive data management

Architectures for massive data management Architectures for massive data management Apache Kafka, Samza, Storm Albert Bifet albert.bifet@telecom-paristech.fr October 20, 2015 Stream Engine Motivation Digital Universe EMC Digital Universe with

More information

Rakam: Distributed Analytics API

Rakam: Distributed Analytics API Rakam: Distributed Analytics API Burak Emre Kabakcı May 30, 2014 Abstract Today, most of the big data applications needs to compute data in real-time since the Internet develops quite fast and the users

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

Hadoop vs Apache Spark

Hadoop vs Apache Spark Innovate, Integrate, Transform Hadoop vs Apache Spark www.altencalsoftlabs.com Introduction Any sufficiently advanced technology is indistinguishable from magic. said Arthur C. Clark. Big data technologies

More information

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

CAPTURING & PROCESSING REAL-TIME DATA ON AWS CAPTURING & PROCESSING REAL-TIME DATA ON AWS @ 2015 Amazon.com, Inc. and Its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent

More information

Evaluation Methodology of Converged Cloud Environments

Evaluation Methodology of Converged Cloud Environments Krzysztof Zieliński Marcin Jarząb Sławomir Zieliński Karol Grzegorczyk Maciej Malawski Mariusz Zyśk Evaluation Methodology of Converged Cloud Environments Cloud Computing Cloud Computing enables convenient,

More information

Planning the Migration of Enterprise Applications to the Cloud

Planning the Migration of Enterprise Applications to the Cloud Planning the Migration of Enterprise Applications to the Cloud A Guide to Your Migration Options: Private and Public Clouds, Application Evaluation Criteria, and Application Migration Best Practices Introduction

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

CDH AND BUSINESS CONTINUITY:

CDH AND BUSINESS CONTINUITY: WHITE PAPER CDH AND BUSINESS CONTINUITY: An overview of the availability, data protection and disaster recovery features in Hadoop Abstract Using the sophisticated built-in capabilities of CDH for tunable

More information

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island Big Data Principles and best practices of scalable real-time data systems NATHAN MARZ JAMES WARREN II MANNING Shelter Island contents preface xiii acknowledgments xv about this book xviii ~1 Anew paradigm

More information

CompTIA Cloud+ 9318; 5 Days, Instructor-led

CompTIA Cloud+ 9318; 5 Days, Instructor-led CompTIA Cloud+ 9318; 5 Days, Instructor-led Course Description The CompTIA Cloud+ certification validates the knowledge and best practices required of IT practitioners working in cloud computing environments,

More information

Scalable Architecture on Amazon AWS Cloud

Scalable Architecture on Amazon AWS Cloud Scalable Architecture on Amazon AWS Cloud Kalpak Shah Founder & CEO, Clogeny Technologies kalpak@clogeny.com 1 * http://www.rightscale.com/products/cloud-computing-uses/scalable-website.php 2 Architect

More information

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA TRENDS AND TECHNOLOGIES BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.

More information

Big Data Analysis: Apache Storm Perspective

Big Data Analysis: Apache Storm Perspective Big Data Analysis: Apache Storm Perspective Muhammad Hussain Iqbal 1, Tariq Rahim Soomro 2 Faculty of Computing, SZABIST Dubai Abstract the boom in the technology has resulted in emergence of new concepts

More information

CompTIA Cloud+ Course Content. Length: 5 Days. Who Should Attend:

CompTIA Cloud+ Course Content. Length: 5 Days. Who Should Attend: CompTIA Cloud+ Length: 5 Days Who Should Attend: Project manager, cloud computing services Cloud engineer Manager, data center SAN Business analyst, cloud computing Summary: The CompTIA Cloud+ certification

More information

Cloud Computing Backgrounder

Cloud Computing Backgrounder Cloud Computing Backgrounder No surprise: information technology (IT) is huge. Huge costs, huge number of buzz words, huge amount of jargon, and a huge competitive advantage for those who can effectively

More information

Increased Security, Greater Agility, Lower Costs for AWS DELPHIX FOR AMAZON WEB SERVICES WHITE PAPER

Increased Security, Greater Agility, Lower Costs for AWS DELPHIX FOR AMAZON WEB SERVICES WHITE PAPER Increased Security, Greater Agility, Lower Costs for AWS DELPHIX FOR AMAZON WEB SERVICES TABLE OF CONTENTS Introduction... 3 Overview: Delphix Virtual Data Platform... 4 Delphix for AWS... 5 Decrease the

More information

Table of Contents. 2015 Cicero, Inc. All rights protected and reserved.

Table of Contents. 2015 Cicero, Inc. All rights protected and reserved. Desktop Analytics Table of Contents Contact Center and Back Office Activity Intelligence... 3 Cicero Discovery Sensors... 3 Business Data Sensor... 5 Business Process Sensor... 5 System Sensor... 6 Session

More information

Datacenters and Cloud Computing. Jia Rao Assistant Professor in CS http://cs.uccs.edu/~jrao/cs5540/spring2014/index.html

Datacenters and Cloud Computing. Jia Rao Assistant Professor in CS http://cs.uccs.edu/~jrao/cs5540/spring2014/index.html Datacenters and Cloud Computing Jia Rao Assistant Professor in CS http://cs.uccs.edu/~jrao/cs5540/spring2014/index.html What is Cloud Computing? A model for enabling ubiquitous, convenient, ondemand network

More information

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Real Time Fraud Detection With Sequence Mining on Big Data Platform Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Open Source Big Data Eco System Query (NOSQL) : Cassandra,

More information

GigaSpaces Real-Time Analytics for Big Data

GigaSpaces Real-Time Analytics for Big Data GigaSpaces Real-Time Analytics for Big Data GigaSpaces makes it easy to build and deploy large-scale real-time analytics systems Rapidly increasing use of large-scale and location-aware social media and

More information

From Spark to Ignition:

From Spark to Ignition: From Spark to Ignition: Fueling Your Business on Real-Time Analytics Eric Frenkiel, MemSQL CEO June 29, 2015 San Francisco, CA What s in Store For This Presentation? 1. MemSQL: A real-time database for

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Mauro Fruet University of Trento - Italy 2011/12/19 Mauro Fruet (UniTN) Distributed File Systems 2011/12/19 1 / 39 Outline 1 Distributed File Systems 2 The Google File System (GFS)

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop ISSN: 2454-2377, October 2015 Big Data and Hadoop Simmi Bagga 1 Satinder Kaur 2 1 Assistant Professor, Sant Hira Dass Kanya MahaVidyalaya, Kala Sanghian, Distt Kpt. INDIA E-mail: simmibagga12@gmail.com

More information

Real-time Big Data Analytics with Storm

Real-time Big Data Analytics with Storm Ron Bodkin Founder & CEO, Think Big June 2013 Real-time Big Data Analytics with Storm Leading Provider of Data Science and Engineering Services Accelerating Your Time to Value IMAGINE Strategy and Roadmap

More information

Chapter 19 Cloud Computing for Multimedia Services

Chapter 19 Cloud Computing for Multimedia Services Chapter 19 Cloud Computing for Multimedia Services 19.1 Cloud Computing Overview 19.2 Multimedia Cloud Computing 19.3 Cloud-Assisted Media Sharing 19.4 Computation Offloading for Multimedia Services 19.5

More information

Testing Big data is one of the biggest

Testing Big data is one of the biggest Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing

More information

GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

More information

Big data management with IBM General Parallel File System

Big data management with IBM General Parallel File System Big data management with IBM General Parallel File System Optimize storage management and boost your return on investment Highlights Handles the explosive growth of structured and unstructured data Offers

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2

More information

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

A Novel Cloud Based Elastic Framework for Big Data Preprocessing School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview

More information

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014 Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ Cloudera World Japan November 2014 WANdisco Background WANdisco: Wide Area Network Distributed Computing Enterprise ready, high availability

More information

Radware ADC-VX Solution. The Agility of Virtual; The Predictability of Physical

Radware ADC-VX Solution. The Agility of Virtual; The Predictability of Physical Radware ADC-VX Solution The Agility of Virtual; The Predictability of Physical Table of Contents General... 3 Virtualization and consolidation trends in the data centers... 3 How virtualization and consolidation

More information

A Secure Strategy using Weighted Active Monitoring Load Balancing Algorithm for Maintaining Privacy in Multi-Cloud Environments

A Secure Strategy using Weighted Active Monitoring Load Balancing Algorithm for Maintaining Privacy in Multi-Cloud Environments IJSTE - International Journal of Science Technology & Engineering Volume 1 Issue 10 April 2015 ISSN (online): 2349-784X A Secure Strategy using Weighted Active Monitoring Load Balancing Algorithm for Maintaining

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum Big Data Analytics with EMC Greenplum and Hadoop Big Data Analytics with EMC Greenplum and Hadoop Ofir Manor Pre Sales Technical Architect EMC Greenplum 1 Big Data and the Data Warehouse Potential All

More information

Predictive Analytics with Storm, Hadoop, R on AWS

Predictive Analytics with Storm, Hadoop, R on AWS Douglas Moore Principal Consultant & Architect February 2013 Predictive Analytics with Storm, Hadoop, R on AWS Leading Provider Data Science and Engineering Services Accelerating Your Time to Value using

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Survey of Distributed Stream Processing for Large Stream Sources

Survey of Distributed Stream Processing for Large Stream Sources Survey of Distributed Stream Processing for Large Stream Sources Supun Kamburugamuve For the PhD Qualifying Exam 12-14- 2013 Advisory Committee Prof. Geoffrey Fox Prof. David Leake Prof. Judy Qiu Table

More information

EWeb: Highly Scalable Client Transparent Fault Tolerant System for Cloud based Web Applications

EWeb: Highly Scalable Client Transparent Fault Tolerant System for Cloud based Web Applications ECE6102 Dependable Distribute Systems, Fall2010 EWeb: Highly Scalable Client Transparent Fault Tolerant System for Cloud based Web Applications Deepal Jayasinghe, Hyojun Kim, Mohammad M. Hossain, Ali Payani

More information

PORTrockIT. Spectrum Protect : faster WAN replication and backups with PORTrockIT

PORTrockIT. Spectrum Protect : faster WAN replication and backups with PORTrockIT 1 PORTrockIT 2 Executive summary IBM Spectrum Protect, previously known as IBM Tivoli Storage Manager or TSM, is the cornerstone of many large companies data protection strategies, offering a wide range

More information

Integration Maturity Model Capability #5: Infrastructure and Operations

Integration Maturity Model Capability #5: Infrastructure and Operations Integration Maturity Model Capability #5: Infrastructure and Operations How improving integration supplies greater agility, cost savings, and revenue opportunity TAKE THE INTEGRATION MATURITY SELFASSESSMENT

More information

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS . 3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS Deliver fast actionable business insights for data scientists, rapid application creation for developers and enterprise-grade

More information

Why Big Data in the Cloud?

Why Big Data in the Cloud? Have 40 Why Big Data in the Cloud? Colin White, BI Research January 2014 Sponsored by Treasure Data TABLE OF CONTENTS Introduction The Importance of Big Data The Role of Cloud Computing Using Big Data

More information

Optimizing Performance. Training Division New Delhi

Optimizing Performance. Training Division New Delhi Optimizing Performance Training Division New Delhi Performance tuning : Goals Minimize the response time for each query Maximize the throughput of the entire database server by minimizing network traffic,

More information

SCALABILITY AND AVAILABILITY

SCALABILITY AND AVAILABILITY SCALABILITY AND AVAILABILITY Real Systems must be Scalable fast enough to handle the expected load and grow easily when the load grows Available available enough of the time Scalable Scale-up increase

More information

White Paper. Cloud Performance Testing

White Paper. Cloud Performance Testing White Paper Cloud Performance Testing Table of Contents Introduction and Background Information...2 Challenges & Limitations of On-Premise Model. 2 Cloud Scope and Service Models... 3 Why Cloud for Performance

More information

Big Data Systems CS 5965/6965 FALL 2015

Big Data Systems CS 5965/6965 FALL 2015 Big Data Systems CS 5965/6965 FALL 2015 Today General course overview Expectations from this course Q&A Introduction to Big Data Assignment #1 General Course Information Course Web Page http://www.cs.utah.edu/~hari/teaching/fall2015.html

More information

StorReduce Technical White Paper Cloud-based Data Deduplication

StorReduce Technical White Paper Cloud-based Data Deduplication StorReduce Technical White Paper Cloud-based Data Deduplication See also at storreduce.com/docs StorReduce Quick Start Guide StorReduce FAQ StorReduce Solution Brief, and StorReduce Blog at storreduce.com/blog

More information

Introducing Storm 1 Core Storm concepts Topology design

Introducing Storm 1 Core Storm concepts Topology design Storm Applied brief contents 1 Introducing Storm 1 2 Core Storm concepts 12 3 Topology design 33 4 Creating robust topologies 76 5 Moving from local to remote topologies 102 6 Tuning in Storm 130 7 Resource

More information

International Journal of Scientific & Engineering Research, Volume 6, Issue 4, April-2015 36 ISSN 2229-5518

International Journal of Scientific & Engineering Research, Volume 6, Issue 4, April-2015 36 ISSN 2229-5518 International Journal of Scientific & Engineering Research, Volume 6, Issue 4, April-2015 36 An Efficient Approach for Load Balancing in Cloud Environment Balasundaram Ananthakrishnan Abstract Cloud computing

More information

Overview of Cloud Computing (ENCS 691K Chapter 1)

Overview of Cloud Computing (ENCS 691K Chapter 1) Overview of Cloud Computing (ENCS 691K Chapter 1) Roch Glitho, PhD Associate Professor and Canada Research Chair My URL - http://users.encs.concordia.ca/~glitho/ Overview of Cloud Computing Towards a definition

More information

Hadoop Cluster Applications

Hadoop Cluster Applications Hadoop Overview Data analytics has become a key element of the business decision process over the last decade. Classic reporting on a dataset stored in a database was sufficient until recently, but yesterday

More information

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel Big Data and Analytics: Getting Started with ArcGIS Mike Park Erik Hoel Agenda Overview of big data Distributed computation User experience Data management Big data What is it? Big Data is a loosely defined

More information

High Performance Applications over the Cloud: Gains and Losses

High Performance Applications over the Cloud: Gains and Losses High Performance Applications over the Cloud: Gains and Losses Dr. Leila Ismail Faculty of Information Technology United Arab Emirates University leila@uaeu.ac.ae http://citweb.uaeu.ac.ae/citweb/profile/leila

More information

Improving MapReduce Performance in Heterogeneous Environments

Improving MapReduce Performance in Heterogeneous Environments UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University of California at Berkeley Motivation 1. MapReduce

More information

Virtualizing Apache Hadoop. June, 2012

Virtualizing Apache Hadoop. June, 2012 June, 2012 Table of Contents EXECUTIVE SUMMARY... 3 INTRODUCTION... 3 VIRTUALIZING APACHE HADOOP... 4 INTRODUCTION TO VSPHERE TM... 4 USE CASES AND ADVANTAGES OF VIRTUALIZING HADOOP... 4 MYTHS ABOUT RUNNING

More information

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

More information