Scalability and Design Patterns

Transcription

1 Scalability and Design Patterns 1. Introduction 2. Dimensional Data Computation Pattern 3. Data Ingestion Pattern 4. DataTorrent Knobs for Scalability 5. Scalability Checklist 6. Conclusion DataTorrent Inc. 1

2 1. Introduction DataTorrent platform enables massive scalability due to its ability to distribute usage of resources across a cluster. These include CPU, Memory, I/O, Buffers, etc. In this section we will look at what scalability means and how to leverage the platform to achieve it. DataTorrent platform enables real time processing for big data space. This means that the application space is in big data by definition. An application is considered a big data application if the resource required to complete the application on time is more than one server (node), i.e. the SLA is not met by the available resources in a single node. The need to use multiple nodes may be due to any of CPU, memory, I/O, or hard disk. Most often a big data application needs more nodes for all the above resources. In later chapters, we will walk through a few design patterns to illustrate scalability. We will illustrate various features within DataTorrent platform that enable the design of a scalable application, and walk through the thought process for the design of an application to take advantage of the platform. The code shown is pseudo code DataTorrent Inc. 2

3 2. Dimensional Data Computation Pattern Dimensional data application consists of ingesting events, computing all if not most of the dimensional combinations, and then computing actions based on the business logic. A dimensional data computation application has the following steps: 1. Events flow in from the front end/sensors, loosely termed as event sources 2. A single event is used to generate all relevant dimensional combinations (if not all) 3. Aggregation is done on these dimensional combinations as per the requirement of the model 4. Models are applied to each of these dimensional combinations per application window 5. The results are stored on a periodic basis, where the periods are integral multiples of the application window. The storage may be to a database, or appended to a file, or emitted to a message bus There are more variations of this pattern, but for the sake of illustrating scalable design we will use the above specific pattern. In this example we will assume that events are pushed via a message bus. As we start to port a generic dimensional data computations code to DataTorrent, we will make successive revisions to the design to enable scalability, and make it a big data application. Lets start by writing this code outside DataTorrent platform. For sake of simplicity we assume that the incoming message can be converted into a string. // Pre DataTorrent application code void onmessage(message m) { String msg= m.getstring(); DataTorrent Inc. 3

4 dimensions = generatedimensions(msg); values = getvalues(msg); for (i = 0; i < dimensions.size(); i++) { updatemodels(compute(aggregate(dimensions(i)))); if (time_since_last_update >= application_window) { storeresults(); // to database, or files, or message bus flushmodels(); // Flush internal cache We see an immediate problem with the above code. If an event does not occur exactly at application window boundary the updates to the storemodels will not be aligned. Users will need to write multi threaded solutions for enable exact time boundary. Moreover this application is not scalable, and certainly not big data as it is designed to run on a single compute node. In revision 1 we will use a very basic building block from DataTorrent platform that relieves an application developer of the multi threading issue, and enables the application to go beyond one node. The application will consist on an input adapter that takes events from onmessage() call and emits it as a String event. Input adapter scales by having multiple partitions, each listening to a channel of the message bus. The scalability of an input adapter is tightly aligned with the system that ingests the data (in this example the message bus). For this discussion we will assume that the system is built to add partitions as need be. We will dive deeper into the computations needed to process each event. The processing of each event will be done as part of the process() call, and the compute results will be emitted/stored as part of endwindow(). Since the compute operator has an application window we will use Operator api to manage it. DataTorrent platform enables users to write the same code irrespective of the value of the application window. The application window is an attribute that can be set at launch time via configuration file. In future you will be able to set this during run time too. In the coding example below, we have skipped inputadapter as it is a standard operators, and simply convert incoming messages into Strings and emit them to the next DataTorrent Inc. 4

5 operator. Here is our first DataTorrent application. // First DataTorrent application (See Figure 1) public class SingleCompterOperatorApplication extends BaseKeyOperator < String> = "data") public final transient DefaultInputPort<String> data = new DefaultInputPort<String>(this); public void process (String tuple) { // Process each tuple, and add it to the models dimensions = generatedimensions(event); values = getvalues(event); for (i = 0; i < dimensions.size(); i++) { public void endwindow () { storeresults(); flushmodels(); This design is much more modular design, as application period, the thread associated with them etc. are taken care off by the platform. The platform guarantees a single thread execution of process() and endwindow() ensures that the code works smoothly as is, and that the application space is now a single thread execution. Further revision is now reduced to If, how and where to split computations to enable scale. Do note that migration of some other application may need a complete redesign of the DataTorrent Inc. 5

6 computational architecture. The platform also ensures that periodic updates/emits happens as needed by calling endwindow() at the end of appropriate application window. The call generatedimensions() can be used to generate time as a dimension and thus guarantee results precise to system clock boundaries. Users can also have a lag time for events to catch up for a time window. For example an application window of 5 minutes can allow 1 more minute for lagging events to catch up. For the purpose of simplicity we will assume all events are arriving in correct time bucket. This document focuses on scalability aspect of the design, and not managing time series. For catch up of lagging events the application has to treat time as a dimension, which also translates to the platforms ability to scale. Figure 1 shows our design revision as a DataTorrent application. We now have two operators, one for ingestion, and one for all the computations of the application. Now that we got our code moved into the platform, we face our first set of scalability issues. The following scalability issues need to be addressed 1. What happens if my events grow and one node is not enough 2. What happens if the number of dimensional combinations are too large and create a resource bottleneck. This could be CPU, Memory, or I/O a. CPU is usually proportional to the ingestion rate and the DataTorrent Inc. 6

7 compute model b. Memory is proportional to the size of the state (keys) of the application, and therefore also dependant on the application window. Memory is also proportional to the algorithms used. c. I/O is proportional to both the method of segmentation of the computations into various containers, as well as the final result outputted. 3. What happens if the output adapter is not able to write results through one process. Can write out be made parallel? There are many more aspects of scalability at the top level, and the internals of resource utilization get dramatically impacted by it. In this revision we address the above listed issues. We will address the other aspects of scalability in later revision. The above three cases are the definition of big data, which means that resources needed to process the data needs more that one server. Our next task is to figure out how a big data processing application scales. To address scalability at the ingestion (Issue 1), we need to partition the input adapter in I partitions. Each of these partition will read a different channel or topic from a message bus. The input adapter thus scales via classical load balance approach, i.e. number of partitions are proportional to the load. If application needs more ingestion, it will add more partitions. Platform allows dynamic addition of partitions to enable addition to ingestion resource at runtime. Each of these partitions will emit strings. As these strings are expanded into dimension combinations, i.e flat key structure, the number of events generated will grow. More operators are needed to handle the generation of dimensional combinations, and their aggregates. The affinity of this operator is very close to the inputadapter, hence a parallel partition is ideal. The number of partitions are decided by two needs. Firstly to be able to ingest all the incoming events, and secondly be able to compute all the dimensional combinations, and their aggregates. The larger of these numbers will decide the number of partitions. Choice of DataTorrent Inc. 7

8 threadlocal, processlocal, nodelocal, or racklocal on the stream is however decided by application needs. The choice is event throughput vs operator isolation ; threadlocal has highest event throughput and lowest operator isolation. We will do detail revision of Issue 2 in later revision. Lets look at Issue 3 now, i.e. how do we scale write/update of result from the application to a system outside. In case of database a bulk load will be used, as it aids atomicity, but it is advisable to do this in pieces (i.e. partitions). In case of writing to a distributed file system (say HDFS), the outbound bandwidth needs multiple partitions to scale (by definition) and will write to part files. In case of message bus, it makes sense to have multiple partitions emitting results, sometimes to different channels/topics. Lets assume we have O partitions to outbound operator. The number O is to be the optimal number for outbound update. Our design is now reflected in Figure 2. In order to be able to fit the operator names in the figure, we will give them short names. The operator names are as follows: op1: Input Adapter op2: Dimensional Combination Generator op3: Aggregation op4: Compute Model op5: Update External System op1, op2, and op3 are connected using parallel partition, i.e. they share the same number of partitions and data is unified after op3. Unifier is an in built feature to allow combining stream partitions into a single stream as per operator logic. Unifiers can be used for NxM partition. The number of unifiers will be M as we have M physical streams, one for each partition of the downstream operator (in this case op4) DataTorrent Inc. 8

9 The next table shows how the above application looks in code. The input adapter and the parallel partition is a DAG property and is not shown in the table. The table consists of breakdown of basic code pattern. It is assumed that stitching of these operators into a DAG and assigning attributes to the application are done separately. // A proper partitioned DataTorrent application (See Figure 2) // op2 is Dimension Combination Generator public class op2 extends BaseOperator = "data") public final transient DefaultInputPort<String> data = new DataTorrent Inc. 9

10 DefaultInputPort<String>(this) public void process (String tuple) { // Process each tuple, and add it to the models dimensions = generatedimensions(event); values = getvalues(event); for (i = 0; i < dimensions.count(); i++) { emit (KeyValPair(dimension(i), values(i)); // op3 is Aggregator public class op3 extends BaseOperator = "data") public final transient DefaultInputPort<String> data = new DefaultInputPort<String>(this) public void process (KeyValPair tuple) { // // Look up in a hash, and aggregate. The code may add, find max, min, // average, etc. aggregate(tuple); public void endwindow () { // emit aggregate for each key emitaggregates(); // // op4 is the computemodel, i.e. the basic business logic // it works on aggregates done in an application window public class op4 extends BaseOperator = data ) DataTorrent Inc. 10

11 public final transient DefaultInputPort<dimenstion_value> data = new public void process (dimension_aggregate) { public void endwindow () { emitresults(); flushmodels(); // op5 outputs data to an external system public class op5 extends BaseOperator = results ) public final transient DefaultInputPort<result_schema> result = new public void process (result_schema ret) { public void endwindow () { updateexternalsystem(ret); flushstate(); Revision 2 has made strong progress towards scalability. We now see an immediate scalability bottlenecks that may show up from the following computations. The number of key combinations for N dimensions are 2**N. This can grow exponentially, and directly results in the stream between Dimension Generator and Compute model to be I/O bound, and/or CPU bound. The fanin to the compute model (op4) will grow and DataTorrent Inc. 11

12 create a bottleneck The compute model is usually proportional to state (number of keys) and the kind of computations needed. Which means that they need the following two problems to be solved The ingestion is load balanced, but compute model operator(s) are balanced by state, and computations, which most likely are sticky key balanced. The number of partitions in the operator that updates the external system has affinity and constraints of the external system. The above two issues are solved by ensuring that the compute model has N partitions, and between each of the phases DataTorrent Inc. 12

13 We now have a generic blue print of the application logical structure that is amenable for the application to scale. For a scalable application to work within the platform, the following need to be taken care of by the application developer Platform cannot break apart a for or a while loop in your code. The breakdown of computations into individual operator has to be done by the application developer. Each leaf level operator is a single compute unit, and a good design will need application to be developed with optimally separated compute units. Migration of a single node application is not just migrating the code, but is a lot of times a re design as the assumptions made in a single node design often do not hold true. Platform checkpoints of each physical operator in its entirety. So a physical operator with a huge state will block processing till the checkpointing is done. Partitioning based on state size enables scale. Dynamic scalability also involves checkpoint loading. So checkpoint state impacts both the scale as well as fault tolerance. The design should have minimal bottlenecks, which means that the design should have a logical structure where partitions reduces resource needs per partition in proportion to the number of partitions. Platform is designed to enable such outcome for a vast majority of design patterns. Platform is able to distribute all resources (CPU, Memory, I/O), and operations (computations, checkpointing, event/data passing) in a scalable manner. We will strive to add more such features as need be. Application should be designed to scale linearly. The platform has a very linear scalable model that handles hundreds of millions of tuples. A common example of possible uneven distribution is key based partition. They are susceptible to skew. The platform does a good job in handling skews, including the ability to load balance and unifying for a large skewed partition. More features will be added to make skew management as smooth as possible in the future. However designs that are easily susceptible to key skew should solved by DataTorrent Inc. 13

14 designing to manage skew. Unifiers are a great way to migrate key based partition to a load balanced partition. Here is what the platform can do Massively scale almost linearly as the bookkeeping is very light. The bookkeeper does not interfere with data flow Distribute resources and operations across a commodity cluster. This includes CPU, Memory, I/O, as well as checkpointing Dynamically scale by adding/deleting more partitions Dynamically enable adding or modifying functionality Measure system parameters in real time and take action Separate operability from business logic. Enable users to focus on business logic. Enable users to design the application such that scalability is completely an operational issue We now have a building block application that can be leveraged to put together an worldwide real time platform. At a top level this is a deployment structure for an enterprise with worldwide customer base. The number of customer data centers, or clusters support can be in hundreds or thousands. The below reference architecture scales into a massive worldwide footprint DataTorrent Inc. 14

15 DataTorrent Inc. 15

16 3. Data Ingestion Pattern Data ingestion is a common pattern in Hadoop eco system. A big part of Hadoop jobs work load consists of running mapreduce applications on files copied from outside of Hadoop (usually the front end), and the results copied back out for consumption of other mapreduce jobs. This ingestion causes big delays in terms of copy, and needing to wait till all files are copied. DataTorrent platform can be used to significantly help in this pattern. It aims to achieve the following Data can be ingested as soon as it is generated. This removes the latency encountered due to file create and copy Jobs can start asap and process events as they come in Data can be filtered, cleansed, and stored in real time, for future use In chapter 2 we had covered scaling input adapter briefly. In this chapter we will analyse in more depth ways to dynamically scale input adapters. In an data ingestion application, it is very common to use a message bus. Discussing data ingestion pattern, we will not use pseudo code. Figure 5 shows how a preliminary design looks We see immediate issues with the above design. There is a single input adapter, HDFS adapter is not partitioned and hence will be limited by one HDFS write. Leveraging thought process in Chapter 2, we can make the DataTorrent Inc. 16

17 following improvements in revision 2: Add partitions to the input adapter Connect input adapter and the filter operator via parallel partition Add partitions to HDFS adapter Collate data to decide which data is streamed to which HDFS adapter. i.e. partition the HDFS adapter via sticky key. This key could be time, server name, etc. Partition collation as per needs The above changes will enable a much better scalable design as seen in Figure 6. Figure 6 is a good design, however it lacks dynamic scalability on the input adapter operator. Filter operator is one for each input adapter connected via parallel partition, so that is handled by the input adapter. Collate operator will dynamically scale as needed. HDFS adapter also needs special care if one partition is not able to handle the write throughput of the key(s) assigned to it, the adapter will need to partition that set of key(s), or write to part files if it is a single key. A generic partition will not work as the files written to HDFS need proper naming scheme. Unlike the dimensional computation model, the storage operation is bottlenecked by throughput as every event coming in eventually generates one write to HDFS DataTorrent Inc. 17

18 For dynamic addition of more input adapters the message bus and its controller will need to enable the following in run time Add more channels, topics, etc. Access the DataTorrent application to create more partitions register each partition with an channel or other methods for that partition to receive data Save state in outside the application to enable stateful recovery. We will use ZooKeeper in this example State of the current properties of the input adapter partition State of data ingestion HDFS adapter to have a file name logic built in to handle more load. Data will be written in part files if need be. DataTorrent platform can be leveraged by using partitions to determine the part file names Figure 7 shows the final design. An application that ingests big data needs a lot of architectural planning if scalability has to be attained. We can further improve on Figure 7 as Ingestion Controller is a single point DataTorrent Inc. 18

19 of failure. We can move it into Hadoop as a DataTorrent application DataTorrent Inc. 19

20 4. DataTorrent Knobs for Scalability DataTorrent platform has built in innovations that are geared for scalability at big data scale. The platform was built to address scalability at big data level. The following is a brief list of such features Operators are single thread executions and pay negligible context switching cost. The model ensures that code is clean and not cluttered with thread management Clean separation of business logic and operability. Enables ability to scale massively. The architecture is designed to enable massive scale in a linear fashion since the operability logic is completely within the platform Thread contention between operators is negligible due to inbuilt platform features. All event buffers are handled by the platform, and the operator api is executed in a single thread All streams are single write and multiple read configurations. This enables scale as there is no write contention to manage A lot of checks and balances are built in at compile time and launch time to enable runtime to be scalable. These include schema check at compile time, thereby removing any schema checks in run time Operations are designed to be executed in as unblocked way as possible. Upstream operators continue to process events even if downstream operators are down. All containers, and their components continue to function as is unless their upstream operators are down Execution is distributed across the cluster, and leverages resources as needed. The scale out is same as map reduce, operators, streams, buffers are partitioned as needed and launched on containers across the grid All resources: CPU, Memory, Network I/O, Buffers, Checkpoints Ability to scale linearly as much as possible by design Stream Locality: Thread local, Process local, Node local, Rack local DataTorrent Inc. 20

21 This feature enables users to align operators (i.e function calls) in the most optimal way. ThreadLocal uses call stack to pass events. The event needs no copy or buffering. Process Local has a buffer in between, but the tuple does not need serialization and thus scales better. Node Local uses local loopback to save on network interface card resource. Bufferserver: The event stream between writer and reader(s) is managed by bufferserver. But disjointing the write from read, the write operator can continue to process events without any blocking by the read operator(s). Buffer server also removes the dependency between two read operations. This separation is critical to scaling of the application NxM partitions: Supports NxM partitions and thereby allows each operator to partition as per their need Parallel Partitions: Enables a series of operators be partitions in unison. This allows a set of operators to leverage various stream localities, as well as go away with NxM partitions in between. This partition scheme simplifies the application significantly Unifiers: They allow users to use load balancing partition even when the computation is key based. This is a major gain as skew is completely removed by completely avoiding sticky key partition. Unifier runs in the upstream container and thus provides a clean interface for the downstream operator Cascading Unifiers: Are very useful in case of unifier having a huge fan in. Cascading unifier puts a limit on the fan ins and enables application to scale Partitionable Unifiers: Are extremely useful when the unifier needs to integrate state from all the partitions. Unique count operator is an example of this. Partitionable unifier allows streams to be partitioned as per the values of the key and thereby allowing scalable operators for functions such as unique counts Ability to insert split and merge operators: as a massive fan out of DataTorrent Inc. 21

22 read operations can be cascaded via a tree structure. At each level the fan out will be controlled. Single write streams also ensure that Runtime skew balancing: Skew in sticky key partitions can be managed at run time. This enables applications to adapt to runtime skew. For example if an internet site has USA audience during daytime and China audience during night time the skew in processing will be balanced in both cases. Scale is thus achieved by removal of skew as a bottleneck DataTorrent Inc. 22

23 5. Scalability Checklist An application developer needs to adapt the thought process and assumptions to ensure scalability of their big data application. A list of possible questions from scalability point of view are as follows Questions for every operator If the computation in the operator requires more CPU will the operator partition CPU requirement linearly? If the inbound or outbound I/O grows will the operator partition I/O linearly? If the state of the operator grows, will the operator be able to partition state linearly? If the computations being done by most optimal leaf level computations? Does the operator need to be split (or merged)? Is the operator able leverage all the needed knobs of the platform? Will the operator scale linearly if an application developer sets an arbitrary large window? Is the state of the operator optimized for scale? Does it block stream processing for too long? Is there a bottleneck in the operator computation? Can another design unblock it? Questions for stream design Does the stream have optimal locality? Are the fan out, and fan in features correctly being leveraged? Should this stream exist? Will it be better to merge the operators? Is the schema of the tuple correct. Will it enable or impede scale? Questions for application development Are there assumptions that were true in smaller scale that need to be discarded for designing a big data application? Assuming the application has access to infinite resources, what will DataTorrent Inc. 23

24 the ideal design be? Are computations being done by most optimal leaf level computations? Should the application be redesigned? Where are the architectural bottlenecks if throughput or computations, etc increase Is the application bottlenecked if an outbound connection is slow Is there a resource contention. For example are all partitions accessing the same database Is the design optimally leveraging the resource boundaries: Container size, cores per container, Network I/O. For example is one container using all the memory, while another container is using minimal memory. Common scalable patterns in customer use cases 1. CPU partitioning can be scaled if compute operation can be broken down into computations that can be done on partial data and then aggregated. A common examples include word count, and filtering during ingestion. Word count leverages sticky key partition will ensure that each partition counts the occurrences of a subset of keys. Aggregations can leverages load balancing partition to distribute computations, and use unifier for final aggregation. 2. Inbound I/O of an application, can be scaled linearly if the computations done as part of ingestion can be partitioned. Parallel partitions are usually very useful in such use cases. Example includes 3. Outbound I/O of an application scales by distribution if the external system can ingests all the I/O. For example if an output adapter writes to MongoDB, then the instance of MongoDb should be able to handle writes from all the partitions. 4. Internal I/O scales if the downstream operator can be partitioned. 5. If the state of the operator grows with data, and the logical operator needs access to the entire state, it will create a bottleneck for scalability. Such an application will need a re design. Unifiers provide DataTorrent Inc. 24

25 a very powerful way to combine load balancing (skew less partition) and yet be able to access aggregates at the end of the window. An example is to track 2nd level connections in a connected graph. If the design has all the connections in the same operator, then the action of getting the first level connections and then their next connections will grow with the size of the graph. A scalable design with more optimal leaf level operators are as follows a. Operator1: Emit all the 1st level connections, as id:1. This computation scales with sticky key load balancing b. Operator2: Emit all next level connections of these as id:2. This operator also scales with sticky key load balancing c. Operator3: Receives outputs of operator1 and operator2, and drops all id:2 if it received id:1 for that id. Emits the final result at the end of window. This operator also scales with sticky key load balancing 6. If in an application two adjacent operators have a massive throughput, leveraging Thread_Local or Process_Local should be considered. A very common example is generating dimensional keys and aggregating them. This use case can leverage stream locality along with parallel partition. 7. A possible bottleneck when processing millions of events/sec is that the application may need a lot of partitions. If the partition is Nx1, then all these partitions have to feed into a single downstream physical operator. The fan in may create a bottleneck. Cascading unifier solves this problem by unifying at a fixed fan out at each level. 8. It is common to have results of a computation as a HashMap. For example the result is say HashMap<String, Integer>. In this example a single tuple is emitted at the end of the window with the above schema. As the data grows so does the tuple. However it being a single tuple means the downstream operator has to consume the tuple in a single physical operator. One way out is to partition the current operator, and each physical partition sending out a HashMap. This patterns has problems with scalability as the operator will need DataTorrent Inc. 25

26 to ensure that a key only exist in one of the hashmap. This forces the operator to take in the input stream with sticky key partition. A more scalable way would be get the operator to emit KeyValPair<String, Integer>. The atomicity of the tuples is now moved to a KeyValPair. The downstream operator can now partition and consume the same data as per load, and has no upper limits on the partition. This problem can only be solved by changing the schema. Even if we put an operator that consumes the HashMap and immediately splits the tuple into KeyValPairs, this split operator still will need to take in one single large HashMap tuple. 9. Checkpointing is done by serializing the operator object to disk. Checkpointing is done at the end of the window. When checkpointing is being done, stream processing is paused. The time taken to serial is thus reduces the number of events that can be processed. It is advisable to reduce the size of the object by either partitioning based on the object size, or using stateless operators as much as possible, or by declaring variables that are not part of the state as transient. For example if an operator is aggregating and emitting the result at the end of every window, the operator is in effect stateless. Such an operator should ensure that all its variables are transient DataTorrent Inc. 26

27 6. Conclusion DataTorrent platform is designed to enable applications to scale and process big data in real time. The platform follows fundamental tenet of achieving scalability by distributing resource utilization across a commodity cluster. However scalability cannot be slapped on to a design, it has to be baked in. The design of an application must be scalable by construction. Platform has various knobs that users can avail of, and to do so the applications must be designed as such. This ability is one of the most crucial expertise needed to design scalable DataTorrent applications. While porting current applications care should be taken to remove assumptions that are baked into a design which will block the application to take advantages of the features in the platform. Designs carry within them constructs that derive from basic assumptions. DataTorrent is a big data platform that enables near massively scalable real time computations in near linear fashion. This requires new thought process and questioning of old assumptions that no longer are true. Big data scalability is a new paradigm. An application developer as well an operator developer must ask what can I do if I have massive compute resources at my disposal, how will my design look?. Application developers also need to understand how an application works when numerous real time tasks are running on various nodes throughout the cluster. How the data flows, windows sync up, and the whole applications runs in parallel. DataTorrent aims to enable an ecosystem that derives from commoditization of compute resources leveraged for real time applications DataTorrent Inc. 27