Imperial College of Science, Technology and Medicine. Department of Computing. Distributed Detection of Event Patterns

Size: px
Start display at page:

Download "Imperial College of Science, Technology and Medicine. Department of Computing. Distributed Detection of Event Patterns"

Transcription

1 Imperial College of Science, Technology and Medicine Department of Computing Distributed Detection of Event Patterns by Nicholas Poul Schultz-Møller Submitted in partial fulfilment of the requirements for the MSc Degree in Advanced Computing of Imperial College London September 2008

2

3 Abstract The nature of data on today s Internet and in enterprises is changing. Data used to be mostly static and stored in large databases, but today new data or changes in existing data are becoming increasingly valuable. Examples are news being published, updates on social websites, blog postings and RSS feeds. Other sources of constantly changing information are currency exchange rates, prices on flight tickets and sensors in, for example, modern windmills that send operational data to a central server. These changes in the data can be thought of as events. Many industries, especially the financial sector, are very interested in detecting patterns of events for decision making. An example application of event pattern detection is soft realtime detection of credit card fraud where each transaction is an event. The interest has led to the development of Complex Event Processing (CEP) Systems which can process and find user specified event patterns, expressed as queries. A particular class of CEP systems with automata-based detection mechanisms has shown high performance, but most of these systems are single server systems which inevitably limits their scalability. Furthermore, query optimization in this type of system has not been investigated. This report presents a new distributed CEP system, the NEXT CEP System, that has a highlevel query language to express event patterns. The detection mechanism is automata-based and it allows filters and predicates with respect to the events in the pattern to be expressed. The temporal model and event model underlying the developed system enables query rewriting which has not been possible in existing systems. Algorithms for query rewriting, multi query optimization and optimization of operator deployment are developed and the efficiency of these is measured in several experiments. The experiments are conducted on the network testbed Emulab, and they show reduced average CPU usage in the range from 7% to 63% compared to non-optimized queries. iii

4

5 Acknowledgement First of all I would like to thank my supervisor, Dr. Peter R. Pietzuch. This project would not have been possible if it was not for him. Thank you for motivating me, giving me suggestions and for having many good discussions. Thanks also to Rowan McRae, a very talented law student and native English speaker, for proof-reading part of my work and giving me excellent advice regarding my language and formulations. I would also like to thank Sara Maria Dalbjerg for my absence in the months while I did the project. It would have been difficult without her tolerance and loving support. Furthermore I would like to thank Goodenough College for providing excellent facilities to study and live in, and my friends (and their theses) for keeping me company in the Pearson Library. v

6

7 Contents 1 Introduction Application Scenario Contributions and Project Goals Report Structure Detection of Event Patterns General Characteristics of Event Detection Systems Publish/Subscribe Systems The Anatomy of an Event Complex Events Complex Event Processing Systems Optimization Goals and Performance Metrics Active Databases Snoop SAMOS Other Systems Conclusion Data Stream Management Systems STREAM: The Stanford Data Stream Management System Borealis Other Systems Conclusion Complex Event Processing Systems Cayuga DistCED A Framework for Event Composition in Distributed Systems Other Systems Conclusion Commercial Event Detection Systems Conclusion Distributed Query Optimization The Problem Insights from Distributed DBMS Query Optimization Algorithms Algorithms for Query Optimization in Distributed Stream Systems Greedy Algorithms Using Hierarchies and Clustering A Cost-Space Approach Dynamic Load Distribution Query Optimization in CEP Systems Conclusion vii

8 4 The Query Language for Event Patterns Definition of Events and Event Streams Temporal Model Automata-Based Event Detection The Core Language for Event Patterns A High-Level Event Query Language Conclusion The NEXT CEP System Design The Central Manager The Operators Implementation Language Implementation Operator Implementation Summary of the Software Engineering Process Conclusion Optimization of Event Pattern Detection Optimization Goal Assumptions and Collected Statistics Cost Model Query Rewriting Optimization of Event Patterns with the Union Operator Optimization of Event Patterns with the Next Operator Optimization of Event Patterns with the Exception Operator Operator Distribution Algorithm Conclusion Evaluation Method Postprocessing Results Overview of the Conducted Experiments Operator Performance Performance of the Filter Operator Performance of the Union Operator Performance of the Next Operator Performance of the Iteration Operator Performance of the Exception Operator Efficiency of the Optimization Algorithms Optimizing Event Patterns with the Union Operator Optimizing Event Patterns with the Next Operator Optimizing Event Patterns with the Next and the Union Operator Optimizing the Deployment Plan Other Experiments Comparison with Existing Systems Conclusion Conclusion Future Work Improvement of the System and the Query Language Improving Cost Functions Dynamic Query Optimization

9 Bibliography 93 A Grammar File for the High-Level Query Language 97 B Test Examples 101 C Building the Optimal Pattern of Next Operators 103 D Error Graphs for Performance Tests of Operators 105

10

11 Chapter 1 Introduction The information available on today s Internet is tremendous and advanced search engines, such as Google, allow us to search the content to retrieve useful information. But the nature of the Internet has changed over the last couple of years from a medium containing mostly static content to be one with mostly dynamic content. Today, RSS feeds, blog postings, social networks, news sites and so on are sources of constant new and changing information. The data describing these changes, can be viewed as events and users are likely to be interested in being notified when such events occur. An example could be an author of a Wikipedia article who is interested in being notified when someone edits his or her article. Many other types of information changes can also be viewed as events. One example is financial data such as changing exchange rates and stock prices, which form the basis of an entire field called algorithmic trading [69]. Being notified about individually occurring events can be of great value, but of even greater value is the detection of patterns of events. If a user is only interested in, for example, important news, an interesting pattern of events would be the publication of news articles from cnn.com, bbc.com and times.com that have the same keywords and occur within 30 minutes of each other. This is illustrated in Figure 1.1, which shows three streams of events (published news articles) which match the pattern. Figure 1.1: Three events (news articles) on three different streams that match a pattern. Other applications of event pattern detection are found in industries as varied as the health sector, supply chain management, Business Activity Monitoring (BAM) and the entertainment industry. As an example of an event pattern in the entertainment industry, a casino would be interested in events where the same player wins significantly more times than probability theory predicts. Events and the detection of event patterns are also used in distributed computer systems. Here, events can be used as a loosely coupled, asynchronous messaging mechanism where components, for example sensors, subscribe to events of interest from other components. This is called an Event Driven Architecture [51]. As events occur they are published (sent) to subscribers. To support detection of event patterns, Complex Event Processing (CEP) systems have been 1

12 2 Introduction developed. These systems, often using automata-based detection mechanisms, allow a client application to submit patterns, expressed as queries, to the CEP system which then notifies the client of occurring complex events matching the query in soft realtime. However, most existing CEP systems run on a single machine and are thus limited in the amount of sources and events they can process because of the CPU resources needed for complex event processing. Many application scenarios, such as the one introduced below, have high rate streams and the event patterns of interest require a lot of processing. This motivates the building of a distributed CEP system. Having a distributed CEP system does however introduce new challenges: In order to utilize the resources in an efficient way, queries have to be optimized. Furthermore, little work has been done on optimizing the queries running in existing CEP systems. In this project, I will develop a distributed Complex Event Processing System and investigate optimization algorithms for optimizing queries. The following application scenario will be used as an example throughout the report, and will furthermore motivate the project. 1.1 Application Scenario The application scenario introduced in this section will illustrate challenges and show how the additional resources in a distributed CEP system can be utilized. Generally there are many interesting applications of CEP systems, especially within algorithmic trading but a simpler scenario has been chosen: detection of credit card fraud. This is also motivated by credit card fraud committed against the author s Visa card in the Spring 2008! Detection of Credit Card Fraud Credit card fraud has been committed for many years and remains a major problem [72]. Just considering the US, 9.91 million people were targeted by credit card fraud in 2007 with a total loss of US$ 52.6 billion [53]. The fraud typically happens when the credit card owner uses the card in a conventional store. During the payment an employee can easily run the card through a card reader without the owner s knowledge. Later the copied credit card information is sold to other criminals who manufacture a new card and commit the actual fraud. There are several patterns for this and many of these can be detected by a CEP system. Using an event detection system is useful because fast detection of suspicious patterns would, for example, allow the bank to contact the customer early in the fraud or even automatically alert the user about suspicious credit card transactions by sending text messages. In the text message, the bank could encourage the customer to contact the bank if the transactions were not initiated by the customer see Figure 1.2. One typical pattern of credit card fraud is that the criminals start by testing the card with a few small transactions. Then they make a few big transactions and afterwards dispose of the card or resell it to other criminals. This could be expressed in a high-level query language as: SELECT * FROM ( t S1 ; t S2 ; t L ) WHERE FILTER(S1.acc = S2.acc), FILTER(S2.acc = L.acc), S1.amount < 100 AND S2.amount < 100 AND L.amount > 250 AND (S1, L) OCCURS WITHIN 12 hours This is an example of the high-level query language developed in this project. The semantics will be introduced in Chapter 4 and therefore only an informal description will be given here. The syntax resembles SQL very much and the semantics are also similar. The FROM clause contains three references to the source t of all Visa credit card transactions in a geographical region (e.g. US) with three different aliases S1, S2 and L. The WHERE clause consists of a list of predicates separated by commas. The first two are filter predicates and they specify that the account number in each of

13 Application Scenario 3 Figure 1.2: An example of a text that could be sent to the customer by the (fictional) bank, HB Bank, when a suspicious event pattern is detected. the three consecutive transactions (events) must be the same. The last predicate specifies that the first two transactions must have an amount of less than 100, the last an amount of greater than 250 and that the first and the last transaction must occur within 12 hours. Several other patterns could also be used to detect fraud. Here, four further patterns are presented. 1. Detect a large transaction occurring in a different region (t2) than other transactions within 12 hours: SELECT * FROM ( t1 A1; t2 B; t1 A2 ) WHERE FILTER(A1.acc = B.acc), FILTER(A2.acc = B.acc), B.amount > 1000 AND (A1, A2) OCCURS WITHIN 12 hours (As an interesting side point, this single query could have reduced the fraud committed against me by approx ) 2. Detect a large transaction in a high risk country (e.g. Russia or Ukraine 1 ): SELECT * FROM ( t ) WHERE t.amount > 1000 AND (t.country = "RU" OR t.country = "UA") 3. Detect three consecutive, large transactions occurring within a short period of time: SELECT * FROM ( t T1; t T2; t T3 ) WHERE FILTER(T1.acc = T2.acc), FILTER(T2.acc = T3.acc), T1.amount > 1000 AND T2.amount > 1000 AND T3.amount > 1000 AND (T1, T2) OCCURS WITHIN 1 hour 4. Detect a series of transactions with a total amount larger than 5000 occurring within 12 hours: SELECT * FROM ( t T1; (t T2+; t T3) ) WHERE FILTER(T1.acc = T2.acc), FILTER(PREV(T2.acc) = T2.acc), FILTER(T1.acc = T3.acc), SUM(T2.amount) > 5000 AND (T1, T3) OCCURS WITHIN 12 hours 1 Note that the choice of these two countries is rather arbitrary and merely serves as an example.

14 4 Introduction Properties of Data Sources The transactions that a CEP system processes are sent from a number of sources. In this section the properties of these event sources in the example application scenario will presented along with the assumptions made. The sources are assumed to be credit card processing companies, which process e.g. all Visa card transactions for a region such as the US. The transactions made with Visa cards in the US will be used as an example source. According to Visa [46], they processed billion transactions in 2007, excluding cash transactions. This gives an average of 875 credit transactions per second assuming that the usage is uniformly distributed. Assuming that 80% of all transactions occur in 8 hours of the day (morning, lunch and early evening) this gives an event rate of 2100 transactions per second. To model this, I assume that the sources with credit card transactions follow a Poisson distribution with an average rate of 2100 events per second. Note that because of the high average rate (µ = 2100), the rate will not vary very much (SD = µ 46), i.e., the probability of the rate being in the interval events per second is 99.9%. The information in each transaction event is assumed to consist of: Description: Size: Example: Transaction id number 8 bytes The account number 4 bytes The amount transferred 4 bytes 4 The currency 3 bytes USD The country code 2 chars/bytes US A short description of the transaction max 255 chars/bytes Tall drip, Starbucks Timestamps 2 8 bytes ( , ) This gives a minimum and maximum size of 29 and 292 bytes respectively of the information in an event. Thus I assume that the average content size is 150 bytes. The content size is approximately modelled with a normal distribution with an average of 150 bytes and a spread of 30 bytes. This gives an average data rate of approximately 300 kb per second per source. This amount of data can easily be handled by a standard 100 Mb Ethernet network. The Need for a Distributed Detection System The data rate of the sources indicate that the amount of data sent across the network is not an issue. However, processing of events is. If a query processes events from several different sources then hundreds of thousands of events have to be processed per second and clearly processing even more events on a single machine does not scale. The bottleneck is the CPU resources available on a single machine (even with current multicore machines). If these are depleted the system has to either discard events or buffer events. However, buffering events is only a solution if the average CPU usage is below 100% because otherwise the buffer will grow unbounded. Another issue is that queries are long running and results could be lost in the case of hardware failure or software errors. Having a distributed detection system allows operators to be replicated to cope with hardware failures and only crash some operators in the case of software errors (of course depending on how pervasive the error is). So distributing the event detection onto several machines have clear advantages but the problem is how to distribute the processing to utilize the additional machines in an efficient way. This is one of the areas investigated in this project.

15 Contributions and Project Goals Contributions and Project Goals The three main contributions of this project are a high-level event query language, a distributed CEP system and algorithms for optimizing queries and their deployment. Furthermore I develop a new automata model for describing the semantics of operators used in event patterns. I also present six core operators, supported by the high-level language, that are implemented in the CEP system. The temporal model and event model underlying the automata model allow query optimizations that have not been possible to do current CEP systems. Three types of optimizations are investigated, query rewriting, multi-query optimization and optimization of operator deployment and all are implemented in the developed CEP system. To support query optimization, I have derived cost functions for the event operators based on their automata models. These provide insights into the low level processing issues that need to be solved to have efficient event pattern detection. Finally I evaluate the performance of the CEP system, its operators and the optimization algorithms on the Emulab network emulation testbed and compare the performance with that of existing CEP systems. To summarize, the goals of this project are: 1. To create a high-level query language for event patterns suitable for query optimization. 2. To design and implement a distributed system for detection of event patterns described in this language. 3. To derive cost functions for queries allowing the detection to be optimized. 4. To design and implement heuristics and algorithms to optimize the queries and the deployment of these with respect to the cost functions. 5. To evaluate the algorithms and system s performance on the network emulation testbed Emulab. I have achieved to implement a distributed CEP system with a high-level language. The system supports multiple clients, queries and event sources as well as sinks. Furthermore, I have achieved to find algorithms that rewrites queries to their optimal form with respect to the derived cost functions. Tests of this optimization show an improved average CPU usage of up to 33%. I have also implemented a greedy algorithm for deploying operators. This algorithm shows an improvement in average CPU usage of 2.8%. Finally, I have implemented algorithms for performing multi-query optimization by reusing existing deployed operators. This optimization shows an improved average CPU usage of 63% when used in combination with the other optimizations. 1.3 Report Structure In Chapter 2, I introduce event based systems and present existing work on detecting events within three different types of systems: Active Databases, Data Stream Management Systems (DSMS) and Complex Event Processing systems. I furthermore discuss the advantages and disadvantages of the different approaches taken by these systems and the techniques they use. Chapter 3 describes the query optimization problem and existing algorithms used in CEP systems, Database and Data Stream Management systems. Chapter 4 introduces the event model, temporal model and automaton model underlying the developed CEP system. Furthermore, I present a core language consisting of six core operators that are described by the automaton model. I also introduce the high-level query language, which has its semantics described by the core operators.

16 6 Introduction In Chapter 5, I present the design and implementation of the developed distributed Complex Event Processing system. Chapter 6 derives cost functions of the core operators and presents the optimization algorithms for rewriting the queries and finding an optimal query distribution plan. Chapter 7 evaluates the performance of the developed detection system and the performance improvements achieved by the optimization algorithms. Finally, Chapter 8 concludes and reflects on the findings in this project and suggests future directions within the field.

17 Chapter 2 Detection of Event Patterns Events have been detected in many different types of systems and using many different approaches and detection mechanisms. This chapter will introduce detection of event patterns and present existing systems and their associated event models, temporal models and query languages. 2.1 General Characteristics of Event Detection Systems This section will introduce the different paradigms and terminologies used within Event Detection Systems and address the most common domain problems. The general architecture underneath most Event Detection Systems is the publish/subscribe architecture, which makes it a natural starting point Publish/Subscribe Systems The origin of a simple event or primitive event is an event source which is located in a network typically the Internet. The situation is shown in Figure 2.1 on the following page. As events occur, the event source can itself publish events or send them to a broker. Any client (a system or a user) interested in events can subscribe to events at the event source (or the broker). The events are then streamed to the subscriber, which is also called an event sink. This is the architectural publish/subscribe paradigm [36] and it has several advantages. First of all, it allows for very loosely coupled systems because publishers are not aware of subscribers and vice versa. Subscribing and unsubscribing is usually handled by middleware software. It also allows asynchronous communication between publishers and subscribers. The disadvantages of the publish/subscribe paradigm are closely related to its advantages: A publisher cannot be guaranteed that a particular subscriber receives its events because it may not be registered and, in general, the systems cannot give any guaranties about event delivery. Another characteristic of traditional publish/subscribe (pub/sub) systems is that filters can be specified by the subscriber. As a result, only events of interest will be sent to the subscriber. The filtering can be topic-based, content-based or a hybrid of the two. Topic-based filtering allows a subscriber to specify the topics it wants to receive events on. On the other hand, content-based filtering allows the subscriber to specify constraints on the attributes of events that it wants to receive. Attributes are data describing the event and predicates describe properties that these must fulfill. This allows a subscriber to reduce the number of events it receives but still leaves the majority of the event handling (e.g. the aggregation) to the subscriber. But before elaborating on this issue, I will introduce several characteristics of events. 7

18 8 Detection of Event Patterns Figure 2.1: A network with a pub/sub system. The event source Src 1 directly publishes its events while Src 2 sends its events to a broker. Sink 1 subscribes to both Src 1 and the broker. Sink 2 only subscribes to events from the broker The Anatomy of an Event Events are modelled differently in various systems because of primarily the assumptions made in the temporal model 1. In general events have one or more associated time stamps depending on whether they are modelled as being instantaneous or as having a duration. Some systems also distinguish between occurrence time and detection time for various purposes (e.g. SAMOS, which is presented in Section 2.2.2) and thus events have additional time stamps. The purpose of the temporal model is primarily to establish an ordering of events, as this is important to many event operators used for specifying patterns. Generally events also carry data a payload. There are several approaches for describing the semantics of the data; the most typical being a schema-based approach (as in databases): Each event is treated as a tuple with various fields including time stamps. Another approach is using XML schemas and tags. The advantage of schema-based events is that the size of an event is generally smaller than an equivalent event described in XML although the schema has to be transmitted separately. On the other hand, XML has the advantage of being relatively self-describing. XML also allows a more complex document structure. XML does, however, also require the schema to be transmitted if syntax checks are needed Complex Events As described in the introduction, individual events themselves do not provide much value detecting a pattern of events can be much more valuable. A sequence of events matching a pattern is typically called a complex event (CE). As described above, traditional pub/sub systems cannot detect complex events. The many events that a CE consist of present some challenges with respect to the temporal properties: When does an event pattern occur, how should it be time stamped, and does a CE have a duration or is it instantaneous? Various systems deal with this in different ways and there are different advantages and disadvantages, which will be explained in Sections 2.2, 2.3 and 2.4. Once a complex event is detected, some systems output it to a client, whereas others publish it as a new event. The latter approach is used in many systems (especially in Data Stream Management Systems (DSMS)) and is elegant because it allows homogeneous composition of complex events into 1 Temporal models will be introduced along with specific systems in Sections 2.2, 2.3 and 2.4.

19 General Characteristics of Event Detection Systems 9 Figure 2.2: A network of nodes that forwards events. In-network processing: The intermediate node N aggregates events from event stream 1 and 2 into a CE stream (stream 3). Data centre approach: Sink 2 consists of a cluster of servers for detecting the CE patterns. other more complex events 2. The detection of complex events is done in complex event processing systems Complex Event Processing Systems In many systems, event processing is in many systems centralized. The CEP system itself is at a data centre and the detectors are distributed across several nodes, as depicted in Figure 2.2. Other distributed approaches exploit the intermediate nodes in the network to correlate and aggregate events into CEs (so-called in-network processing), which are then republished, as also shown in Figure 2.2. These two approaches have different advantages and disadvantages. A CEP system with its detectors scattered across the network can lower bandwidth usage compared to a data centre because it can filter events closer to the source. But the network, which would be a Wide Area Network (WAN) or the Internet, generally has lower performance than a network in a data centre. Furthermore, the nodes are placed in a hostile environment (e.g. DoS-attacks and network contention). The lower bandwidth and greater distance between nodes also make query optimization more challenging: Gathering statistics to inform the query optimization and operator reallocation require stable network links with enough network bandwidth. In a data centre, the network is used solely by event detection nodes and the network can have high-performance links. This allows easier load-balancing and operator redistribution which entails higher performance. Another advantage of a data centre deployment is that the CEP system and query optimization algorithms are simpler to design. Comparison with Database Management Systems CEP systems, especially DSMS, have many shared characteristics with Database Management Systems (DBMS) but are also in many ways duals of DBMS [71]. This makes it interesting to compare the two, as substantial research has been done in DBMS and many mature techniques have been developed. 2 In some papers, complex events are in fact denoted as composite events, e.g. [11, 38, 49, 50, 55].

20 10 Detection of Event Patterns In both types of systems, one or more clients issue queries that describe the information that the client(s) need. In general, queries include filtering, correlation, aggregation and grouping of data. But while there are usually few concurrent transient queries 3 in a DBMS, there tends to be numerous concurrent and persistent queries in a CEP system because clients are interested in event patterns that happen over time. Another duality is that a DBMS is an active system because it pulls data from memory or disc, while a CEP system is passive as events are pushed. This subtle difference introduces a lot of new challenges that DBMS do not have to deal with. First of all, a CEP system can experience bursts of events that can deplete its CPU and memory resources because event streams are by their very nature stochastic. This forces the CEP system to make use of shedding [8]. Shedding is the process of selecting and discarding events to prevent breakdown of the system due to overload. Shedding techniques are a research topic in itself and will not be further investigated in this project. The fact that events are sent across a network also introduces challenges because of stream imperfections. Events may arrive out of order, be delayed or there may be published corrections to previous events (e.g. an update of published news). A few systems are designed to deal with stream imperfections (e.g. [2]) but to keep focus on the distribution of queries and CE detection, this area will not be further studied. The many concurrent queries (typically >10000 depending on the application) also make it essential to do multi query optimizations (MQO), which is also done in DBMS although it is not as crucial for the performance as in CEP systems. As the events arrive from sources that are dispersed across a network and have a stochastic arrival rate, a CEP system must continuously gather statistics to do re-optimization which involves a search for a (sub) optimal query plan. This is in contrast to databases where statistics are instantly available because statistics are recalculated whenever data changes. Another difference with respect to query optimization is the computational effort that a query optimizer can spend on finding the best plan in the two types of systems. In a database, where short response time (latency) is usually the optimization goal, the query optimizer constantly has to consider the tradeoff between spending time on finding a better plan and executing a suboptimal plan. In a CEP system, it will generally be feasible to spend more resources on finding an optimal query plan because the queries are continuous. Furthermore, a client of a CEP system would seldom require a query to be quickly started. In the above discussion, the notion of performance metrics has been implicit. The next section will look at the different metrics that could be considered in a cost model for queries in a CEP system Optimization Goals and Performance Metrics There could be several, often conflicting goals for optimizing a CEP System. Typically a client would be interested in receiving notification of a CE with low latency and high reliability (low probability of shedding). From the system s perspective, it would be advantageous to have a high throughput of events per time unit, high utility of the resources, low network usage and high readiness for new queries to be registered in the system. These considerations all point to the need for a unified cost model when trying to optimize a query. As different clients might have different requirements, this indicates that it would be interesting for the clients to give different weights to the performance metrics for different queries 4. Different existing cost models will be introduced along with algorithms for optimizing event detection in Section 3.3. Often the overall performance of a system running on a single computer is a compromise between different optimization goals. For instance, it is difficult to achieve high throughput and low latency (which both imply high resource utility) when numerous queries are registered, and still have a 3 Queries end as soon as they are evaluated. 4 As done in e.g. [62].

21 Active Databases 11 high readiness for new queries. An obvious way to overcome some of the limitations of having a detection system running on a single computer is to use a distributed detection system. 2.2 Active Databases Ever since databases were invented, they have been the main data handling and storage component of most commercial information systems. As all interesting data in an application is usually made persistent in the database, this made it a natural place to detect events and take appropriate action. The type of actions of interest could be many: updating statistical information, monitoring changes of domain specific data and reacting to special conditions. In databases events and actions to be executed are specified with Event-Condition-Action (ECA) rules (SQL:1999). In this context an event is a change in the database that activates a trigger. This could be an insertion, an update or a deletion of a row. The condition is an SQL query without side effects that is interpreted as true if the result set is nonempty and otherwise false. If the condition evaluates to true then the action part of the rule is executed. The action can access the results of the query evaluated in the condition, execute new queries, use data definition commands (e.g. create new tables) and even execute procedural code [56]. One of the criticisms of active databases [56] is that the action of an ECA rule can activate other triggers, a so called recursive trigger, which can make it difficult to understand the aggregate result of a trigger. Another issue regarding triggers is performance. The query in the condition of a rule is evaluated each time an event occurs even if it is irrelevant with respect to the condition part of the rule. This makes it very inefficient depending on the size of the query and motivates the need for complex event detection in active databases to allow more efficient trigger mechanisms. This section will introduce two important systems for detecting events in active databases, Snoop and SAMOS, and look at the their event detection mechanisms Snoop One of the first and most influential works on active databases was Snoop [11], a declarative event specification language that is model independent. The latter means that logical events (model independent) are distinguished from physical events (implementation specific). A logical event is defined as an atomic occurrence, which in other words means that either the event completely occurs or not at all. Associated with each event is the time of occurrence, t occ. The temporal model of Snoop assumes an equi-distant discrete time domain, which is a time line with 0 as the origin and all time points represented as non-negative integers. Another assumption in the model is that only a single event can occur at a given time point. Physical events are not necessarily atomic. For instance, a database insertion operation has a certain duration. To achieve a mapping between physical and logical events, Snoop has event modifiers that can be either begin-of, end-of or user defined (all three are in fact implementation specific). This allows any physical event with a duration to be mapped to two distinct logical events. An event also has a type and parameters. The super types are primitive events and composite events. The former has several subtypes (database events, explicit events and temporal events). As part of the mapping from physical to logical events, the parameters of primitive events need to be instantiated upon detection; how this is achieved is implementation specific. Snoop s Event Specification Language To describe the events in an ECA rule, Snoop has the concept of event expressions. A simple event is an event expression. Composite events are recursively defined as an event expression with primitive or composite events as operands for a set of operators.

22 12 Detection of Event Patterns The different types of operators are: The disjunction operator is a binary operator. The composite event, E1 Or E2, occurs when either event E1 or event E2 or both 5 occur. The sequence operator is also a binary operator and the composite event, E1 ; E2, occurs when the event E1 occurs followed by the event E2. The conjunction operators are the Any(m, E1,..., En) operator and the All(E1,..., En) operator. The Any operator occurs if m of E1,..., En occur. The All operator is just a shorthand for Any(n, E1,..., En). The aperiodic operators are the A(E1, E2, E3) operator and its variant A*. The aperiodic operator A can be used to express each occurrence of an event E2 in an interval bounded by events E1 and E3. The * version, A*, occurs only once if E2 occurs several times (once or more), and if so the parameters of E2 are aggregated using an aggregation operator (e.g. summation). The periodic operators are the P(E1, [t], E2) operator and its * version. P(E1, [t], E2) occurs periodically every [t] (a time specification) after E1 has occurred and before E2 occurs. Again the * version only occurs once for each interval with parameters 6 accumulated. Detection of Composite Events Snoop also devises an algorithm for efficiently detecting composite events. Each event expression associated with a rule is translated into an operator tree. The leaves of the operator tree are primitive events, and the internal nodes represent composite events, which is equivalent to operators applied to subexpressions/subtrees. The edges between nodes are directed from the leaves towards the root to indicate the activation caused by an event. The algorithm works by first coalescing the operator trees of equivalent event expressions into a common event graph. When a primitive event occurs it first activates the leaf node, then instantiates its parameters and then all nodes attached to its outgoing edges. Depending on the operator semantics of the activated nodes, new composite events are propagated in the graph along with aggregated accumulated parameters. If a rule has a condition then this is evaluated as a side effect of the node activation and if true an action is executed. Common event subexpressions are only evaluated once for a given (composite) event. This technique is similar to other techniques found in [23, 38], which will be investigated in Sections and Discussion of Snoop In Snoop it is possible to express many different composite events. In [11] several possible rewrites of event expressions are explored that would be relevant when trying to find an optimal evaluation plan. But it is not possible to express infinite sequences of events such as E* (the Kleene Star [43, 44]). The temporal model also presents a problem in a distributed setting: It assumes that only a single event can occur at any given point in time. This is a reasonable assumption in a database system running on a single core machine, but in a distributed system multiple events could occur at the same time that is multiple events could get the same discrete timestamp 7. 5 This depends on whether concurrent events are possible in the implementation. 6 A parameter for t can be specified, e.g. the name of a stock who s price is sampled periodically. 7 There will always be some skewness between the clocks in a distributed system so equal timestamps do not imply same physical time of occurrence [48].

23 Active Databases SAMOS SAMOS is an active database prototype that uses an extension of Coloured Petri Nets [47] for composite event detection [37, 38]. The idea of using Petri Nets is to provide a more efficient way of detecting composite events than the naive approach, in which all occurred primitive events are examined for event patterns whenever a new primitive event occurs. This is achieved by allowing an incremental detection because the Petri Net can store the current state of the detection process and thus keep track of which events have already been detected in a composite event pattern. In SAMOS an event is modelled as having a time of occurrence. There is a distinction between the occurrence of an event and the signalling of an event, i.e., the time of detection. Similar to Snoop s physical events, which have a duration and are mapped to a logical event, SAMOS defines an event to be either the beginning or the end of a database transaction, the creation time of an object, the return time of a method or simply an explicit time point. As in Snoop the events are parameterized, although there is only a small fixed set of parameters, e.g., the occurrence time, the user who started a transaction etc. All these parameters are database specific. SAMOS s Event Specification Language Event patterns can be recursively described using six different event constructors. These are: Disjunction A disjunction of events, (E1 E2), occurs when either E1 or E2 (exclusive-or) occurs. Conjunction A conjunction of events, (E1,E2), occurs if both events occur regardless of order. Sequence A sequence of events, (E1;E2). The construct has the same semantics as in Snoop. History A history of events, (TIMES(n, E) IN I), occurs when the event E occurs n times in interval I. Negation A Negative Event, (NOT E IN I), occurs if E does not occur in interval I. Star (*) A *-constructor which can be applied to an event constructor E, such that multiple occurrences of the event are signaled only once. The occurrence time of a composite event is equal to the last event that occurs in the pattern (e.g. the last event of E1 and E2 in (E1,E2), the n th E in a History Event or the end of the interval I in a Negative Event). Detection of Composite Events Composite events are detected by translating event patterns into SAMOS Petri Nets (S-PN), which are extensions of Coloured Petri Nets (C-PN). The reader may only be familiar with regular Petri Nets but the main difference with respect to C-PN is that places have different types and there are arc-expressions on the arcs. Arc-expressions are functions that bind and manipulate the contents of tokens. Furthermore there are guards on the transitions that determine if a transition can fire. For a detailed introduction to Coloured Petri Nets, see [47]. S-PN extends C-PN by allowing tokens to carry complex information: the parameters of events. The idea is to let places represent event patterns and tokens be the actual events detected (up until now). Thus the marking of the S-PN is the current state of the event detection. The arc expressions are used to transform parameters of (potentially composite) events into the parameters of composite events. Three examples of event constructors are shown in Figure 2.3 the conjunction, disjunction and sequence of two events E1 and E2. Once a primitive event occurs, a token with the parameter values is instantiated on the corresponding place, and the token-game is played until no more firings are possible. If a token is located at a place that models a specified event pattern in a rule, the token is consumed and the action is fired.

24 14 Detection of Event Patterns Figure 2.3: From top left to right, the S-PN of the three composite event constructors: conjunction (E1,E2), disjunction (E1 E2) and sequence (E1;E2). The function (x,y) computes the union of the parameters x and y. Note that in the S-PN for (E1;E2) the place H (with an initial token) prevents the transition t from firing until E1 has occurred. All of these simple S-PN are combined into a Combination S-PN that merge repeated patterns into a single S-PN. This makes the detection more efficient. The actual algorithm for playing the token-game is described in [37]. It includes a matrix for representing the arcs in the S-PN. Once a token (event) is added to a place, the algorithm iterates over the row that describes the inputarcs-to-transitions from the place and then tries to fire the transitions. If a transition can fire, the corresponding column describing the output-arcs-to-places is traversed and a new token is placed accordingly. This procedure is repeated until no transitions can fire. Discussion of SAMOS Using an augmented Coloured Petri Net as a formal model for describing and detecting event patterns is an elegant idea with well-defined semantics for the operators. But, as is apparent from e.g. the S-PN for the sequence operator, they quickly become very complex and large particularly when S-PNs of several operators are combined into one Combination S-PN. Another issue is that SAMOS does not allow augmentation of parameters. Furthermore the language is not very expressive; e.g. it is not possible to express an infinite sequence of events (i.e. the Kleene star). The detection algorithm maintains a matrix for representing the arcs in the S-PN and iterates several times over rows and columns when playing the token-game. Clearly, this is infeasible and inefficient for a large S-PN. First of all the matrix describing the S-PN will usually be sparse and thus it would be better to use e.g. adjacency lists. The algorithm for playing the token-game may also be slow and involve many calculations when a single event occurs, e.g. when a single event triggers a series of firings. As later approaches for detection of complex event patterns will show, there are more efficient detection mechanisms Other Systems One of the most simple approaches to create an active database and the first to introduce the notion of continuous queries is the Tapestry system [68]. In the Tapestry system, a user specifies a query in the Tapestry Query Language (TQL), which is converted into SQL and run periodically against an Append-Only database, in which events are inserted as they occur. The system is not very scalable due to the periodic query evaluations. Another system for detecting events in an object-oriented database is the Ode Active Database [49]. In Ode the database, its manipulation, the queries and the triggers are described in a language, O++, which is an extension of C++. The specification of the event pattern part of a trigger has expressive power equivalent to regular expressions. The actual detection is also done by converting

25 Data Stream Management Systems 15 event patterns into finite state machines. This allows for efficient detection. But Ode is not suitable for distribution for several reasons. First, O++ is an extension of C++ and the O++ code is compiled into C++ code, which makes it much more difficult to do query optimization because cost models of the operators will be very difficult to derive (or even impossible). Second, Ode is very tightly integrated with object databases. This makes it unsuitable for general event detection from e.g. XML streams. But the detection mechanism based on automata is very interesting as will be seen later Conclusion The event detection in the presented languages and implementations of ECA rules for databases reflect early work in the field of detecting complex events. They are generally biased towards event detection in databases, which was their primary purpose, and this makes them less able as general purpose complex event specification and detection systems. Another issue is performance. The presented systems cannot be expected to be very scalable. As such the existing work in active databases provides few insights with respect to how to build a distributed system for event pattern detection with high performance. 2.3 Data Stream Management Systems Data Stream Management Systems (DSMS) is a class of systems that manage data arriving from a series of streams. As events can be modelled as data arriving on streams, DSMS can be used to detect complex event patterns. Often DSMS are integrated with traditional Database Management Systems, which allows queries on both streams and data in database (relations) to be evaluated in a homogenous way. This and the fact that their query languages are very expressive make them very interesting. In general their languages (and systems) model events as tuples (as in databases) and are thus often extensions of SQL or relational algebra. The extensions consist of constructs, such as windows, to deal with the temporal separation of events and the spatial separation of sources. In this section, two of the most interesting systems, STREAM and Borealis, will be presented in detail STREAM: The Stanford Data Stream Management System STREAM is a DSMS developed at Stanford with a declarative language called the Continuous Query Language (CQL) [5, 6]. Some of the goals for CQL and STREAM were to build a general purpose DSMS and while exploiting existing knowledge of the relational model. The latter would allow reuse of existing techniques for query rewrites, query optimizations, choice of execution strategies (e.g. operator scheduling) according to the current situation. It would also make it easier for users familiar with SQL to write CQL queries. These goals have to a large extent been reached [6]. Data Model The data model of STREAM consists of streams and updatable relations and each stream or updatable relation has a fixed schema (as in traditional databases). A stream S is modelled as a possible infinite bag (multiset) of elements s, τ where s is a tuple with the schema of S and τ T is the timestamp of the element. The domain T is a discrete and ordered domain of timestamps. A relation R is defined to be a mapping from T to a finite but unbounded bag of tuples r belonging to the schema of R. Thus at a time point τ i there could be a finite (even zero) number of tuples { s, τ i S} in a stream S. Similarly R(τ i ), denoting the tuples in R at the time τ i, could be a finite (even zero) number of tuples { r R}.

Event-based middleware services

Event-based middleware services 3 Event-based middleware services The term event service has different definitions. In general, an event service connects producers of information and interested consumers. The service acquires events

More information

CLOUD BASED SEMANTIC EVENT PROCESSING FOR

CLOUD BASED SEMANTIC EVENT PROCESSING FOR CLOUD BASED SEMANTIC EVENT PROCESSING FOR MONITORING AND MANAGEMENT OF SUPPLY CHAINS A VLTN White Paper Dr. Bill Karakostas Bill.karakostas@vltn.be Executive Summary Supply chain visibility is essential

More information

Processing Flows of Information: From Data Stream to Complex Event Processing

Processing Flows of Information: From Data Stream to Complex Event Processing Processing Flows of Information: From Data Stream to Complex Event Processing GIANPAOLO CUGOLA and ALESSANDRO MARGARA Dip. di Elettronica e Informazione Politecnico di Milano, Italy A large number of distributed

More information

Database Replication with MySQL and PostgreSQL

Database Replication with MySQL and PostgreSQL Database Replication with MySQL and PostgreSQL Fabian Mauchle Software and Systems University of Applied Sciences Rapperswil, Switzerland www.hsr.ch/mse Abstract Databases are used very often in business

More information

2 Associating Facts with Time

2 Associating Facts with Time TEMPORAL DATABASES Richard Thomas Snodgrass A temporal database (see Temporal Database) contains time-varying data. Time is an important aspect of all real-world phenomena. Events occur at specific points

More information

Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification

Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 Outline More Complex SQL Retrieval Queries

More information

Introduction to Service Oriented Architectures (SOA)

Introduction to Service Oriented Architectures (SOA) Introduction to Service Oriented Architectures (SOA) Responsible Institutions: ETHZ (Concept) ETHZ (Overall) ETHZ (Revision) http://www.eu-orchestra.org - Version from: 26.10.2007 1 Content 1. Introduction

More information

Middleware support for the Internet of Things

Middleware support for the Internet of Things Middleware support for the Internet of Things Karl Aberer, Manfred Hauswirth, Ali Salehi School of Computer and Communication Sciences Ecole Polytechnique Fédérale de Lausanne (EPFL) CH-1015 Lausanne,

More information

Programma della seconda parte del corso

Programma della seconda parte del corso Programma della seconda parte del corso Introduction Reliability Performance Risk Software Performance Engineering Layered Queueing Models Stochastic Petri Nets New trends in software modeling: Metamodeling,

More information

Web Analytics Understand your web visitors without web logs or page tags and keep all your data inside your firewall.

Web Analytics Understand your web visitors without web logs or page tags and keep all your data inside your firewall. Web Analytics Understand your web visitors without web logs or page tags and keep all your data inside your firewall. 5401 Butler Street, Suite 200 Pittsburgh, PA 15201 +1 (412) 408 3167 www.metronomelabs.com

More information

Data Stream Management and Complex Event Processing in Esper. INF5100, Autumn 2010 Jarle Søberg

Data Stream Management and Complex Event Processing in Esper. INF5100, Autumn 2010 Jarle Søberg Data Stream Management and Complex Event Processing in Esper INF5100, Autumn 2010 Jarle Søberg Outline Overview of Esper DSMS and CEP concepts in Esper Examples taken from the documentation A lot of possibilities

More information

JoramMQ, a distributed MQTT broker for the Internet of Things

JoramMQ, a distributed MQTT broker for the Internet of Things JoramMQ, a distributed broker for the Internet of Things White paper and performance evaluation v1.2 September 214 mqtt.jorammq.com www.scalagent.com 1 1 Overview Message Queue Telemetry Transport () is

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

Software Life-Cycle Management

Software Life-Cycle Management Ingo Arnold Department Computer Science University of Basel Theory Software Life-Cycle Management Architecture Styles Overview An Architecture Style expresses a fundamental structural organization schema

More information

Web Traffic Capture. 5401 Butler Street, Suite 200 Pittsburgh, PA 15201 +1 (412) 408 3167 www.metronomelabs.com

Web Traffic Capture. 5401 Butler Street, Suite 200 Pittsburgh, PA 15201 +1 (412) 408 3167 www.metronomelabs.com Web Traffic Capture Capture your web traffic, filtered and transformed, ready for your applications without web logs or page tags and keep all your data inside your firewall. 5401 Butler Street, Suite

More information

Getting Real Real Time Data Integration Patterns and Architectures

Getting Real Real Time Data Integration Patterns and Architectures Getting Real Real Time Data Integration Patterns and Architectures Nelson Petracek Senior Director, Enterprise Technology Architecture Informatica Digital Government Institute s Enterprise Architecture

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

Testing & Assuring Mobile End User Experience Before Production. Neotys

Testing & Assuring Mobile End User Experience Before Production. Neotys Testing & Assuring Mobile End User Experience Before Production Neotys Agenda Introduction The challenges Best practices NeoLoad mobile capabilities Mobile devices are used more and more At Home In 2014,

More information

Semantic-ontological combination of Business Rules and Business Processes in IT Service Management

Semantic-ontological combination of Business Rules and Business Processes in IT Service Management Semantic-ontological combination of Business Rules and Business Processes in IT Service Management Alexander Sellner 1, Christopher Schwarz 1, Erwin Zinser 1 1 FH JOANNEUM University of Applied Sciences,

More information

The Synergy of SOA, Event-Driven Architecture (EDA), and Complex Event Processing (CEP)

The Synergy of SOA, Event-Driven Architecture (EDA), and Complex Event Processing (CEP) The Synergy of SOA, Event-Driven Architecture (EDA), and Complex Event Processing (CEP) Gerhard Bayer Senior Consultant International Systems Group, Inc. gbayer@isg-inc.com http://www.isg-inc.com Table

More information

Oracle Database 10g: Introduction to SQL

Oracle Database 10g: Introduction to SQL Oracle University Contact Us: 1.800.529.0165 Oracle Database 10g: Introduction to SQL Duration: 5 Days What you will learn This course offers students an introduction to Oracle Database 10g database technology.

More information

See the wood for the trees

See the wood for the trees See the wood for the trees Dr. Harald Schöning Head of Research The world is becoming digital socienty government economy Digital Society Digital Government Digital Enterprise 2 Data is Getting Bigger

More information

USING COMPLEX EVENT PROCESSING TO MANAGE PATTERNS IN DISTRIBUTION NETWORKS

USING COMPLEX EVENT PROCESSING TO MANAGE PATTERNS IN DISTRIBUTION NETWORKS USING COMPLEX EVENT PROCESSING TO MANAGE PATTERNS IN DISTRIBUTION NETWORKS Foued BAROUNI Eaton Canada FouedBarouni@eaton.com Bernard MOULIN Laval University Canada Bernard.Moulin@ift.ulaval.ca ABSTRACT

More information

Modern Databases. Database Systems Lecture 18 Natasha Alechina

Modern Databases. Database Systems Lecture 18 Natasha Alechina Modern Databases Database Systems Lecture 18 Natasha Alechina In This Lecture Distributed DBs Web-based DBs Object Oriented DBs Semistructured Data and XML Multimedia DBs For more information Connolly

More information

2. Basic Relational Data Model

2. Basic Relational Data Model 2. Basic Relational Data Model 2.1 Introduction Basic concepts of information models, their realisation in databases comprising data objects and object relationships, and their management by DBMS s that

More information

SemCast: Semantic Multicast for Content-based Data Dissemination

SemCast: Semantic Multicast for Content-based Data Dissemination SemCast: Semantic Multicast for Content-based Data Dissemination Olga Papaemmanouil Brown University Uğur Çetintemel Brown University Wide Area Stream Dissemination Clients Data Sources Applications Network

More information

Elena Baralis, Silvia Chiusano Politecnico di Torino. Pag. 1. Query optimization. DBMS Architecture. Query optimizer. Query optimizer.

Elena Baralis, Silvia Chiusano Politecnico di Torino. Pag. 1. Query optimization. DBMS Architecture. Query optimizer. Query optimizer. DBMS Architecture INSTRUCTION OPTIMIZER Database Management Systems MANAGEMENT OF ACCESS METHODS BUFFER MANAGER CONCURRENCY CONTROL RELIABILITY MANAGEMENT Index Files Data Files System Catalog BASE It

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Temporal Database System

Temporal Database System Temporal Database System Jaymin Patel MEng Individual Project 18 June 2003 Department of Computing, Imperial College, University of London Supervisor: Peter McBrien Second Marker: Ian Phillips Abstract

More information

The Sierra Clustered Database Engine, the technology at the heart of

The Sierra Clustered Database Engine, the technology at the heart of A New Approach: Clustrix Sierra Database Engine The Sierra Clustered Database Engine, the technology at the heart of the Clustrix solution, is a shared-nothing environment that includes the Sierra Parallel

More information

Instant SQL Programming

Instant SQL Programming Instant SQL Programming Joe Celko Wrox Press Ltd. INSTANT Table of Contents Introduction 1 What Can SQL Do for Me? 2 Who Should Use This Book? 2 How To Use This Book 3 What You Should Know 3 Conventions

More information

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General

More information

Glossary of Object Oriented Terms

Glossary of Object Oriented Terms Appendix E Glossary of Object Oriented Terms abstract class: A class primarily intended to define an instance, but can not be instantiated without additional methods. abstract data type: An abstraction

More information

i. Node Y Represented by a block or part. SysML::Block,

i. Node Y Represented by a block or part. SysML::Block, OMG SysML Requirements Traceability (informative) This document has been published as OMG document ptc/07-03-09 so it can be referenced by Annex E of the OMG SysML specification. This document describes

More information

Unit 4.3 - Storage Structures 1. Storage Structures. Unit 4.3

Unit 4.3 - Storage Structures 1. Storage Structures. Unit 4.3 Storage Structures Unit 4.3 Unit 4.3 - Storage Structures 1 The Physical Store Storage Capacity Medium Transfer Rate Seek Time Main Memory 800 MB/s 500 MB Instant Hard Drive 10 MB/s 120 GB 10 ms CD-ROM

More information

Dependability in Web Services

Dependability in Web Services Dependability in Web Services Christian Mikalsen chrismi@ifi.uio.no INF5360, Spring 2008 1 Agenda Introduction to Web Services. Extensible Web Services Architecture for Notification in Large- Scale Systems.

More information

EMC DOCUMENTUM MANAGING DISTRIBUTED ACCESS

EMC DOCUMENTUM MANAGING DISTRIBUTED ACCESS EMC DOCUMENTUM MANAGING DISTRIBUTED ACCESS This white paper describes the various distributed architectures supported by EMC Documentum and the relative merits and demerits of each model. It can be used

More information

In-memory databases and innovations in Business Intelligence

In-memory databases and innovations in Business Intelligence Database Systems Journal vol. VI, no. 1/2015 59 In-memory databases and innovations in Business Intelligence Ruxandra BĂBEANU, Marian CIOBANU University of Economic Studies, Bucharest, Romania babeanu.ruxandra@gmail.com,

More information

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) 15-799 10/21/2013

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) 15-799 10/21/2013 F1: A Distributed SQL Database That Scales Presentation by: Alex Degtiar (adegtiar@cmu.edu) 15-799 10/21/2013 What is F1? Distributed relational database Built to replace sharded MySQL back-end of AdWords

More information

DBMS / Business Intelligence, SQL Server

DBMS / Business Intelligence, SQL Server DBMS / Business Intelligence, SQL Server Orsys, with 30 years of experience, is providing high quality, independant State of the Art seminars and hands-on courses corresponding to the needs of IT professionals.

More information

Supporting Views in Data Stream Management Systems

Supporting Views in Data Stream Management Systems 1 Supporting Views in Data Stream Management Systems THANAA M. GHANEM University of St. Thomas AHMED K. ELMAGARMID Purdue University PER-ÅKE LARSON Microsoft Research and WALID G. AREF Purdue University

More information

Oracle SQL. Course Summary. Duration. Objectives

Oracle SQL. Course Summary. Duration. Objectives Oracle SQL Course Summary Identify the major structural components of the Oracle Database 11g Create reports of aggregated data Write SELECT statements that include queries Retrieve row and column data

More information

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications White Paper Table of Contents Overview...3 Replication Types Supported...3 Set-up &

More information

Local Area Networks transmission system private speedy and secure kilometres shared transmission medium hardware & software

Local Area Networks transmission system private speedy and secure kilometres shared transmission medium hardware & software Local Area What s a LAN? A transmission system, usually private owned, very speedy and secure, covering a geographical area in the range of kilometres, comprising a shared transmission medium and a set

More information

Architecture Design & Sequence Diagram. Week 7

Architecture Design & Sequence Diagram. Week 7 Architecture Design & Sequence Diagram Week 7 Announcement Reminder Midterm I: 1:00 1:50 pm Wednesday 23 rd March Ch. 1, 2, 3 and 26.5 Hour 1, 6, 7 and 19 (pp.331 335) Multiple choice Agenda (Lecture)

More information

Distributed Systems LEEC (2005/06 2º Sem.)

Distributed Systems LEEC (2005/06 2º Sem.) Distributed Systems LEEC (2005/06 2º Sem.) Introduction João Paulo Carvalho Universidade Técnica de Lisboa / Instituto Superior Técnico Outline Definition of a Distributed System Goals Connecting Users

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation Facebook: Cassandra Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/24 Outline 1 2 3 Smruti R. Sarangi Leader Election

More information

ECMA-400. Smart Data Centre Resource Monitoring and Control. 1 st Edition / December 2011. Reference number ECMA-123:2009

ECMA-400. Smart Data Centre Resource Monitoring and Control. 1 st Edition / December 2011. Reference number ECMA-123:2009 ECMA-400 1 st Edition / December 2011 Smart Data Centre Resource Monitoring and Control Reference number ECMA-123:2009 Ecma International 2009 COPYRIGHT PROTECTED DOCUMENT Ecma International 2011 Contents

More information

Formal Modeling Approach for Supply Chain Event Management

Formal Modeling Approach for Supply Chain Event Management Formal Modeling Approach for Supply Chain Event Management Rong Liu and Akhil Kumar Smeal College of Business Penn State University University Park, PA 16802, USA {rul110,akhilkumar}@psu.edu Wil van der

More information

Oracle Database: SQL and PL/SQL Fundamentals NEW

Oracle Database: SQL and PL/SQL Fundamentals NEW Oracle University Contact Us: + 38516306373 Oracle Database: SQL and PL/SQL Fundamentals NEW Duration: 5 Days What you will learn This Oracle Database: SQL and PL/SQL Fundamentals training delivers the

More information

The Import & Export of Data from a Database

The Import & Export of Data from a Database The Import & Export of Data from a Database Introduction The aim of these notes is to investigate a conceptually simple model for importing and exporting data into and out of an object-relational database,

More information

Management by Network Search

Management by Network Search Management by Network Search Misbah Uddin, Prof. Rolf Stadler KTH Royal Institute of Technology, Sweden Dr. Alex Clemm Cisco Systems, CA, USA November 11, 2014 ANRP Award Talks Session IETF 91 Honolulu,

More information

Virtual Full Replication for Scalable. Distributed Real-Time Databases

Virtual Full Replication for Scalable. Distributed Real-Time Databases Virtual Full Replication for Scalable Distributed Real-Time Databases Thesis Proposal Technical Report HS-IKI-TR-06-006 Gunnar Mathiason gunnar.mathiason@his.se University of Skövde June, 2006 1 Abstract

More information

Hypertable Architecture Overview

Hypertable Architecture Overview WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for

More information

CHAPTER 5 WLDMA: A NEW LOAD BALANCING STRATEGY FOR WAN ENVIRONMENT

CHAPTER 5 WLDMA: A NEW LOAD BALANCING STRATEGY FOR WAN ENVIRONMENT 81 CHAPTER 5 WLDMA: A NEW LOAD BALANCING STRATEGY FOR WAN ENVIRONMENT 5.1 INTRODUCTION Distributed Web servers on the Internet require high scalability and availability to provide efficient services to

More information

Oracle Database: SQL and PL/SQL Fundamentals

Oracle Database: SQL and PL/SQL Fundamentals Oracle University Contact Us: 1.800.529.0165 Oracle Database: SQL and PL/SQL Fundamentals Duration: 5 Days What you will learn This course is designed to deliver the fundamentals of SQL and PL/SQL along

More information

Database Application Developer Tools Using Static Analysis and Dynamic Profiling

Database Application Developer Tools Using Static Analysis and Dynamic Profiling Database Application Developer Tools Using Static Analysis and Dynamic Profiling Surajit Chaudhuri, Vivek Narasayya, Manoj Syamala Microsoft Research {surajitc,viveknar,manojsy}@microsoft.com Abstract

More information

University of Pennsylvania. This work was partially supported by ONR MURI N00014-07-0907, NSF CNS-0721845 and NSF IIS-0812270.

University of Pennsylvania. This work was partially supported by ONR MURI N00014-07-0907, NSF CNS-0721845 and NSF IIS-0812270. DMaC: : Distributed Monitoring and Checking Wenchao Zhou, Oleg Sokolsky, Boon Thau Loo, Insup Lee University of Pennsylvania This work was partially supported by ONR MURI N00014-07-0907, NSF CNS-0721845

More information

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011 SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications Jürgen Primsch, SAP AG July 2011 Why In-Memory? Information at the Speed of Thought Imagine access to business data,

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

Architecting for the cloud designing for scalability in cloud-based applications

Architecting for the cloud designing for scalability in cloud-based applications An AppDynamics Business White Paper Architecting for the cloud designing for scalability in cloud-based applications The biggest difference between cloud-based applications and the applications running

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information

Cosmos. Big Data and Big Challenges. Pat Helland July 2011

Cosmos. Big Data and Big Challenges. Pat Helland July 2011 Cosmos Big Data and Big Challenges Pat Helland July 2011 1 Outline Introduction Cosmos Overview The Structured s Project Some Other Exciting Projects Conclusion 2 What Is COSMOS? Petabyte Store and Computation

More information

Semester Thesis Traffic Monitoring in Sensor Networks

Semester Thesis Traffic Monitoring in Sensor Networks Semester Thesis Traffic Monitoring in Sensor Networks Raphael Schmid Departments of Computer Science and Information Technology and Electrical Engineering, ETH Zurich Summer Term 2006 Supervisors: Nicolas

More information

Technical Investigation of Computational Resource Interdependencies

Technical Investigation of Computational Resource Interdependencies Technical Investigation of Computational Resource Interdependencies By Lars-Eric Windhab Table of Contents 1. Introduction and Motivation... 2 2. Problem to be solved... 2 3. Discussion of design choices...

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.

More information

Real-Time Component Software. slide credits: H. Kopetz, P. Puschner

Real-Time Component Software. slide credits: H. Kopetz, P. Puschner Real-Time Component Software slide credits: H. Kopetz, P. Puschner Overview OS services Task Structure Task Interaction Input/Output Error Detection 2 Operating System and Middleware Applica3on So5ware

More information

Compliance and Requirement Traceability for SysML v.1.0a

Compliance and Requirement Traceability for SysML v.1.0a 1. Introduction: Compliance and Traceability for SysML v.1.0a This document provides a formal statement of compliance and associated requirement traceability for the SysML v. 1.0 alpha specification, which

More information

FAWN - a Fast Array of Wimpy Nodes

FAWN - a Fast Array of Wimpy Nodes University of Warsaw January 12, 2011 Outline Introduction 1 Introduction 2 3 4 5 Key issues Introduction Growing CPU vs. I/O gap Contemporary systems must serve millions of users Electricity consumed

More information

Software Verification: Infinite-State Model Checking and Static Program

Software Verification: Infinite-State Model Checking and Static Program Software Verification: Infinite-State Model Checking and Static Program Analysis Dagstuhl Seminar 06081 February 19 24, 2006 Parosh Abdulla 1, Ahmed Bouajjani 2, and Markus Müller-Olm 3 1 Uppsala Universitet,

More information

Best Practices for Managing Virtualized Environments

Best Practices for Managing Virtualized Environments WHITE PAPER Introduction... 2 Reduce Tool and Process Sprawl... 2 Control Virtual Server Sprawl... 3 Effectively Manage Network Stress... 4 Reliably Deliver Application Services... 5 Comprehensively Manage

More information

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

More information

GigaSpaces Real-Time Analytics for Big Data

GigaSpaces Real-Time Analytics for Big Data GigaSpaces Real-Time Analytics for Big Data GigaSpaces makes it easy to build and deploy large-scale real-time analytics systems Rapidly increasing use of large-scale and location-aware social media and

More information

A Survey Study on Monitoring Service for Grid

A Survey Study on Monitoring Service for Grid A Survey Study on Monitoring Service for Grid Erkang You erkyou@indiana.edu ABSTRACT Grid is a distributed system that integrates heterogeneous systems into a single transparent computer, aiming to provide

More information

Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows

Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows TECHNISCHE UNIVERSITEIT EINDHOVEN Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows Lloyd A. Fasting May 2014 Supervisors: dr. M. Firat dr.ir. M.A.A. Boon J. van Twist MSc. Contents

More information

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Processing with Google s MapReduce. Alexandru Costan 1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

More information

Availability Digest. www.availabilitydigest.com. Raima s High-Availability Embedded Database December 2011

Availability Digest. www.availabilitydigest.com. Raima s High-Availability Embedded Database December 2011 the Availability Digest Raima s High-Availability Embedded Database December 2011 Embedded processing systems are everywhere. You probably cannot go a day without interacting with dozens of these powerful

More information

GOAL-BASED INTELLIGENT AGENTS

GOAL-BASED INTELLIGENT AGENTS International Journal of Information Technology, Vol. 9 No. 1 GOAL-BASED INTELLIGENT AGENTS Zhiqi Shen, Robert Gay and Xuehong Tao ICIS, School of EEE, Nanyang Technological University, Singapore 639798

More information

Performance Workload Design

Performance Workload Design Performance Workload Design The goal of this paper is to show the basic principles involved in designing a workload for performance and scalability testing. We will understand how to achieve these principles

More information

Reference Architecture, Requirements, Gaps, Roles

Reference Architecture, Requirements, Gaps, Roles Reference Architecture, Requirements, Gaps, Roles The contents of this document are an excerpt from the brainstorming document M0014. The purpose is to show how a detailed Big Data Reference Architecture

More information

Modular Communication Infrastructure Design with Quality of Service

Modular Communication Infrastructure Design with Quality of Service Modular Communication Infrastructure Design with Quality of Service Pawel Wojciechowski and Péter Urbán Distributed Systems Laboratory School of Computer and Communication Sciences Swiss Federal Institute

More information

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2

More information

Object Oriented Databases. OOAD Fall 2012 Arjun Gopalakrishna Bhavya Udayashankar

Object Oriented Databases. OOAD Fall 2012 Arjun Gopalakrishna Bhavya Udayashankar Object Oriented Databases OOAD Fall 2012 Arjun Gopalakrishna Bhavya Udayashankar Executive Summary The presentation on Object Oriented Databases gives a basic introduction to the concepts governing OODBs

More information

Data Modeling Basics

Data Modeling Basics Information Technology Standard Commonwealth of Pennsylvania Governor's Office of Administration/Office for Information Technology STD Number: STD-INF003B STD Title: Data Modeling Basics Issued by: Deputy

More information

An Approach to Implement Map Reduce with NoSQL Databases

An Approach to Implement Map Reduce with NoSQL Databases www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh

More information

Classic Grid Architecture

Classic Grid Architecture Peer-to to-peer Grids Classic Grid Architecture Resources Database Database Netsolve Collaboration Composition Content Access Computing Security Middle Tier Brokers Service Providers Middle Tier becomes

More information

ZooKeeper. Table of contents

ZooKeeper. Table of contents by Table of contents 1 ZooKeeper: A Distributed Coordination Service for Distributed Applications... 2 1.1 Design Goals...2 1.2 Data model and the hierarchical namespace...3 1.3 Nodes and ephemeral nodes...

More information

Unified Batch & Stream Processing Platform

Unified Batch & Stream Processing Platform Unified Batch & Stream Processing Platform Himanshu Bari Director Product Management Most Big Data Use Cases Are About Improving/Re-write EXISTING solutions To KNOWN problems Current Solutions Were Built

More information

Patterns of Information Management

Patterns of Information Management PATTERNS OF MANAGEMENT Patterns of Information Management Making the right choices for your organization s information Summary of Patterns Mandy Chessell and Harald Smith Copyright 2011, 2012 by Mandy

More information

A Comparison of Database Query Languages: SQL, SPARQL, CQL, DMX

A Comparison of Database Query Languages: SQL, SPARQL, CQL, DMX ISSN: 2393-8528 Contents lists available at www.ijicse.in International Journal of Innovative Computer Science & Engineering Volume 3 Issue 2; March-April-2016; Page No. 09-13 A Comparison of Database

More information

TECHNIQUES FOR OPTIMIZING THE RELATIONSHIP BETWEEN DATA STORAGE SPACE AND DATA RETRIEVAL TIME FOR LARGE DATABASES

TECHNIQUES FOR OPTIMIZING THE RELATIONSHIP BETWEEN DATA STORAGE SPACE AND DATA RETRIEVAL TIME FOR LARGE DATABASES Techniques For Optimizing The Relationship Between Data Storage Space And Data Retrieval Time For Large Databases TECHNIQUES FOR OPTIMIZING THE RELATIONSHIP BETWEEN DATA STORAGE SPACE AND DATA RETRIEVAL

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Software Requirements Specification. Schlumberger Scheduling Assistant. for. Version 0.2. Prepared by Design Team A. Rice University COMP410/539

Software Requirements Specification. Schlumberger Scheduling Assistant. for. Version 0.2. Prepared by Design Team A. Rice University COMP410/539 Software Requirements Specification for Schlumberger Scheduling Assistant Page 1 Software Requirements Specification for Schlumberger Scheduling Assistant Version 0.2 Prepared by Design Team A Rice University

More information

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc. Oracle BI EE Implementation on Netezza Prepared by SureShot Strategies, Inc. The goal of this paper is to give an insight to Netezza architecture and implementation experience to strategize Oracle BI EE

More information

Quality of Service versus Fairness. Inelastic Applications. QoS Analogy: Surface Mail. How to Provide QoS?

Quality of Service versus Fairness. Inelastic Applications. QoS Analogy: Surface Mail. How to Provide QoS? 18-345: Introduction to Telecommunication Networks Lectures 20: Quality of Service Peter Steenkiste Spring 2015 www.cs.cmu.edu/~prs/nets-ece Overview What is QoS? Queuing discipline and scheduling Traffic

More information

Big Data and Market Surveillance. April 28, 2014

Big Data and Market Surveillance. April 28, 2014 Big Data and Market Surveillance April 28, 2014 Copyright 2014 Scila AB. All rights reserved. Scila AB reserves the right to make changes to the information contained herein without prior notice. No part

More information

Elena Baralis, Silvia Chiusano Politecnico di Torino. Pag. 1. Active database systems. Triggers. Triggers. Active database systems.

Elena Baralis, Silvia Chiusano Politecnico di Torino. Pag. 1. Active database systems. Triggers. Triggers. Active database systems. Active database systems Database Management Systems Traditional DBMS operation is passive Queries and updates are explicitly requested by users The knowledge of processes operating on data is typically

More information