Scalability and Design Patterns

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Scalability and Design Patterns"

Transcription

1 Scalability and Design Patterns 1. Introduction 2. Dimensional Data Computation Pattern 3. Data Ingestion Pattern 4. DataTorrent Knobs for Scalability 5. Scalability Checklist 6. Conclusion DataTorrent Inc. 1

2 1. Introduction DataTorrent platform enables massive scalability due to its ability to distribute usage of resources across a cluster. These include CPU, Memory, I/O, Buffers, etc. In this section we will look at what scalability means and how to leverage the platform to achieve it. DataTorrent platform enables real time processing for big data space. This means that the application space is in big data by definition. An application is considered a big data application if the resource required to complete the application on time is more than one server (node), i.e. the SLA is not met by the available resources in a single node. The need to use multiple nodes may be due to any of CPU, memory, I/O, or hard disk. Most often a big data application needs more nodes for all the above resources. In later chapters, we will walk through a few design patterns to illustrate scalability. We will illustrate various features within DataTorrent platform that enable the design of a scalable application, and walk through the thought process for the design of an application to take advantage of the platform. The code shown is pseudo code DataTorrent Inc. 2

3 2. Dimensional Data Computation Pattern Dimensional data application consists of ingesting events, computing all if not most of the dimensional combinations, and then computing actions based on the business logic. A dimensional data computation application has the following steps: 1. Events flow in from the front end/sensors, loosely termed as event sources 2. A single event is used to generate all relevant dimensional combinations (if not all) 3. Aggregation is done on these dimensional combinations as per the requirement of the model 4. Models are applied to each of these dimensional combinations per application window 5. The results are stored on a periodic basis, where the periods are integral multiples of the application window. The storage may be to a database, or appended to a file, or emitted to a message bus There are more variations of this pattern, but for the sake of illustrating scalable design we will use the above specific pattern. In this example we will assume that events are pushed via a message bus. As we start to port a generic dimensional data computations code to DataTorrent, we will make successive revisions to the design to enable scalability, and make it a big data application. Lets start by writing this code outside DataTorrent platform. For sake of simplicity we assume that the incoming message can be converted into a string. // Pre DataTorrent application code void onmessage(message m) { String msg= m.getstring(); DataTorrent Inc. 3

4 dimensions = generatedimensions(msg); values = getvalues(msg); for (i = 0; i < dimensions.size(); i++) { updatemodels(compute(aggregate(dimensions(i)))); if (time_since_last_update >= application_window) { storeresults(); // to database, or files, or message bus flushmodels(); // Flush internal cache We see an immediate problem with the above code. If an event does not occur exactly at application window boundary the updates to the storemodels will not be aligned. Users will need to write multi threaded solutions for enable exact time boundary. Moreover this application is not scalable, and certainly not big data as it is designed to run on a single compute node. In revision 1 we will use a very basic building block from DataTorrent platform that relieves an application developer of the multi threading issue, and enables the application to go beyond one node. The application will consist on an input adapter that takes events from onmessage() call and emits it as a String event. Input adapter scales by having multiple partitions, each listening to a channel of the message bus. The scalability of an input adapter is tightly aligned with the system that ingests the data (in this example the message bus). For this discussion we will assume that the system is built to add partitions as need be. We will dive deeper into the computations needed to process each event. The processing of each event will be done as part of the process() call, and the compute results will be emitted/stored as part of endwindow(). Since the compute operator has an application window we will use Operator api to manage it. DataTorrent platform enables users to write the same code irrespective of the value of the application window. The application window is an attribute that can be set at launch time via configuration file. In future you will be able to set this during run time too. In the coding example below, we have skipped inputadapter as it is a standard operators, and simply convert incoming messages into Strings and emit them to the next DataTorrent Inc. 4

5 operator. Here is our first DataTorrent application. // First DataTorrent application (See Figure 1) public class SingleCompterOperatorApplication extends BaseKeyOperator < String> = "data") public final transient DefaultInputPort<String> data = new DefaultInputPort<String>(this); public void process (String tuple) { // Process each tuple, and add it to the models dimensions = generatedimensions(event); values = getvalues(event); for (i = 0; i < dimensions.size(); i++) { public void endwindow () { storeresults(); flushmodels(); This design is much more modular design, as application period, the thread associated with them etc. are taken care off by the platform. The platform guarantees a single thread execution of process() and endwindow() ensures that the code works smoothly as is, and that the application space is now a single thread execution. Further revision is now reduced to If, how and where to split computations to enable scale. Do note that migration of some other application may need a complete redesign of the DataTorrent Inc. 5

6 computational architecture. The platform also ensures that periodic updates/emits happens as needed by calling endwindow() at the end of appropriate application window. The call generatedimensions() can be used to generate time as a dimension and thus guarantee results precise to system clock boundaries. Users can also have a lag time for events to catch up for a time window. For example an application window of 5 minutes can allow 1 more minute for lagging events to catch up. For the purpose of simplicity we will assume all events are arriving in correct time bucket. This document focuses on scalability aspect of the design, and not managing time series. For catch up of lagging events the application has to treat time as a dimension, which also translates to the platforms ability to scale. Figure 1 shows our design revision as a DataTorrent application. We now have two operators, one for ingestion, and one for all the computations of the application. Now that we got our code moved into the platform, we face our first set of scalability issues. The following scalability issues need to be addressed 1. What happens if my events grow and one node is not enough 2. What happens if the number of dimensional combinations are too large and create a resource bottleneck. This could be CPU, Memory, or I/O a. CPU is usually proportional to the ingestion rate and the DataTorrent Inc. 6

7 compute model b. Memory is proportional to the size of the state (keys) of the application, and therefore also dependant on the application window. Memory is also proportional to the algorithms used. c. I/O is proportional to both the method of segmentation of the computations into various containers, as well as the final result outputted. 3. What happens if the output adapter is not able to write results through one process. Can write out be made parallel? There are many more aspects of scalability at the top level, and the internals of resource utilization get dramatically impacted by it. In this revision we address the above listed issues. We will address the other aspects of scalability in later revision. The above three cases are the definition of big data, which means that resources needed to process the data needs more that one server. Our next task is to figure out how a big data processing application scales. To address scalability at the ingestion (Issue 1), we need to partition the input adapter in I partitions. Each of these partition will read a different channel or topic from a message bus. The input adapter thus scales via classical load balance approach, i.e. number of partitions are proportional to the load. If application needs more ingestion, it will add more partitions. Platform allows dynamic addition of partitions to enable addition to ingestion resource at runtime. Each of these partitions will emit strings. As these strings are expanded into dimension combinations, i.e flat key structure, the number of events generated will grow. More operators are needed to handle the generation of dimensional combinations, and their aggregates. The affinity of this operator is very close to the inputadapter, hence a parallel partition is ideal. The number of partitions are decided by two needs. Firstly to be able to ingest all the incoming events, and secondly be able to compute all the dimensional combinations, and their aggregates. The larger of these numbers will decide the number of partitions. Choice of DataTorrent Inc. 7

8 threadlocal, processlocal, nodelocal, or racklocal on the stream is however decided by application needs. The choice is event throughput vs operator isolation ; threadlocal has highest event throughput and lowest operator isolation. We will do detail revision of Issue 2 in later revision. Lets look at Issue 3 now, i.e. how do we scale write/update of result from the application to a system outside. In case of database a bulk load will be used, as it aids atomicity, but it is advisable to do this in pieces (i.e. partitions). In case of writing to a distributed file system (say HDFS), the outbound bandwidth needs multiple partitions to scale (by definition) and will write to part files. In case of message bus, it makes sense to have multiple partitions emitting results, sometimes to different channels/topics. Lets assume we have O partitions to outbound operator. The number O is to be the optimal number for outbound update. Our design is now reflected in Figure 2. In order to be able to fit the operator names in the figure, we will give them short names. The operator names are as follows: op1: Input Adapter op2: Dimensional Combination Generator op3: Aggregation op4: Compute Model op5: Update External System op1, op2, and op3 are connected using parallel partition, i.e. they share the same number of partitions and data is unified after op3. Unifier is an in built feature to allow combining stream partitions into a single stream as per operator logic. Unifiers can be used for NxM partition. The number of unifiers will be M as we have M physical streams, one for each partition of the downstream operator (in this case op4) DataTorrent Inc. 8

9 The next table shows how the above application looks in code. The input adapter and the parallel partition is a DAG property and is not shown in the table. The table consists of breakdown of basic code pattern. It is assumed that stitching of these operators into a DAG and assigning attributes to the application are done separately. // A proper partitioned DataTorrent application (See Figure 2) // op2 is Dimension Combination Generator public class op2 extends BaseOperator = "data") public final transient DefaultInputPort<String> data = new DataTorrent Inc. 9

10 DefaultInputPort<String>(this) public void process (String tuple) { // Process each tuple, and add it to the models dimensions = generatedimensions(event); values = getvalues(event); for (i = 0; i < dimensions.count(); i++) { emit (KeyValPair(dimension(i), values(i)); // op3 is Aggregator public class op3 extends BaseOperator = "data") public final transient DefaultInputPort<String> data = new DefaultInputPort<String>(this) public void process (KeyValPair tuple) { // // Look up in a hash, and aggregate. The code may add, find max, min, // average, etc. aggregate(tuple); public void endwindow () { // emit aggregate for each key emitaggregates(); // // op4 is the computemodel, i.e. the basic business logic // it works on aggregates done in an application window public class op4 extends BaseOperator = data ) DataTorrent Inc. 10

11 public final transient DefaultInputPort<dimenstion_value> data = new public void process (dimension_aggregate) { public void endwindow () { emitresults(); flushmodels(); // op5 outputs data to an external system public class op5 extends BaseOperator = results ) public final transient DefaultInputPort<result_schema> result = new public void process (result_schema ret) { public void endwindow () { updateexternalsystem(ret); flushstate(); Revision 2 has made strong progress towards scalability. We now see an immediate scalability bottlenecks that may show up from the following computations. The number of key combinations for N dimensions are 2**N. This can grow exponentially, and directly results in the stream between Dimension Generator and Compute model to be I/O bound, and/or CPU bound. The fanin to the compute model (op4) will grow and DataTorrent Inc. 11

12 create a bottleneck The compute model is usually proportional to state (number of keys) and the kind of computations needed. Which means that they need the following two problems to be solved The ingestion is load balanced, but compute model operator(s) are balanced by state, and computations, which most likely are sticky key balanced. The number of partitions in the operator that updates the external system has affinity and constraints of the external system. The above two issues are solved by ensuring that the compute model has N partitions, and between each of the phases DataTorrent Inc. 12

13 We now have a generic blue print of the application logical structure that is amenable for the application to scale. For a scalable application to work within the platform, the following need to be taken care of by the application developer Platform cannot break apart a for or a while loop in your code. The breakdown of computations into individual operator has to be done by the application developer. Each leaf level operator is a single compute unit, and a good design will need application to be developed with optimally separated compute units. Migration of a single node application is not just migrating the code, but is a lot of times a re design as the assumptions made in a single node design often do not hold true. Platform checkpoints of each physical operator in its entirety. So a physical operator with a huge state will block processing till the checkpointing is done. Partitioning based on state size enables scale. Dynamic scalability also involves checkpoint loading. So checkpoint state impacts both the scale as well as fault tolerance. The design should have minimal bottlenecks, which means that the design should have a logical structure where partitions reduces resource needs per partition in proportion to the number of partitions. Platform is designed to enable such outcome for a vast majority of design patterns. Platform is able to distribute all resources (CPU, Memory, I/O), and operations (computations, checkpointing, event/data passing) in a scalable manner. We will strive to add more such features as need be. Application should be designed to scale linearly. The platform has a very linear scalable model that handles hundreds of millions of tuples. A common example of possible uneven distribution is key based partition. They are susceptible to skew. The platform does a good job in handling skews, including the ability to load balance and unifying for a large skewed partition. More features will be added to make skew management as smooth as possible in the future. However designs that are easily susceptible to key skew should solved by DataTorrent Inc. 13

14 designing to manage skew. Unifiers are a great way to migrate key based partition to a load balanced partition. Here is what the platform can do Massively scale almost linearly as the bookkeeping is very light. The bookkeeper does not interfere with data flow Distribute resources and operations across a commodity cluster. This includes CPU, Memory, I/O, as well as checkpointing Dynamically scale by adding/deleting more partitions Dynamically enable adding or modifying functionality Measure system parameters in real time and take action Separate operability from business logic. Enable users to focus on business logic. Enable users to design the application such that scalability is completely an operational issue We now have a building block application that can be leveraged to put together an worldwide real time platform. At a top level this is a deployment structure for an enterprise with worldwide customer base. The number of customer data centers, or clusters support can be in hundreds or thousands. The below reference architecture scales into a massive worldwide footprint DataTorrent Inc. 14

15 DataTorrent Inc. 15

16 3. Data Ingestion Pattern Data ingestion is a common pattern in Hadoop eco system. A big part of Hadoop jobs work load consists of running mapreduce applications on files copied from outside of Hadoop (usually the front end), and the results copied back out for consumption of other mapreduce jobs. This ingestion causes big delays in terms of copy, and needing to wait till all files are copied. DataTorrent platform can be used to significantly help in this pattern. It aims to achieve the following Data can be ingested as soon as it is generated. This removes the latency encountered due to file create and copy Jobs can start asap and process events as they come in Data can be filtered, cleansed, and stored in real time, for future use In chapter 2 we had covered scaling input adapter briefly. In this chapter we will analyse in more depth ways to dynamically scale input adapters. In an data ingestion application, it is very common to use a message bus. Discussing data ingestion pattern, we will not use pseudo code. Figure 5 shows how a preliminary design looks We see immediate issues with the above design. There is a single input adapter, HDFS adapter is not partitioned and hence will be limited by one HDFS write. Leveraging thought process in Chapter 2, we can make the DataTorrent Inc. 16

17 following improvements in revision 2: Add partitions to the input adapter Connect input adapter and the filter operator via parallel partition Add partitions to HDFS adapter Collate data to decide which data is streamed to which HDFS adapter. i.e. partition the HDFS adapter via sticky key. This key could be time, server name, etc. Partition collation as per needs The above changes will enable a much better scalable design as seen in Figure 6. Figure 6 is a good design, however it lacks dynamic scalability on the input adapter operator. Filter operator is one for each input adapter connected via parallel partition, so that is handled by the input adapter. Collate operator will dynamically scale as needed. HDFS adapter also needs special care if one partition is not able to handle the write throughput of the key(s) assigned to it, the adapter will need to partition that set of key(s), or write to part files if it is a single key. A generic partition will not work as the files written to HDFS need proper naming scheme. Unlike the dimensional computation model, the storage operation is bottlenecked by throughput as every event coming in eventually generates one write to HDFS DataTorrent Inc. 17

18 For dynamic addition of more input adapters the message bus and its controller will need to enable the following in run time Add more channels, topics, etc. Access the DataTorrent application to create more partitions register each partition with an channel or other methods for that partition to receive data Save state in outside the application to enable stateful recovery. We will use ZooKeeper in this example State of the current properties of the input adapter partition State of data ingestion HDFS adapter to have a file name logic built in to handle more load. Data will be written in part files if need be. DataTorrent platform can be leveraged by using partitions to determine the part file names Figure 7 shows the final design. An application that ingests big data needs a lot of architectural planning if scalability has to be attained. We can further improve on Figure 7 as Ingestion Controller is a single point DataTorrent Inc. 18

19 of failure. We can move it into Hadoop as a DataTorrent application DataTorrent Inc. 19

20 4. DataTorrent Knobs for Scalability DataTorrent platform has built in innovations that are geared for scalability at big data scale. The platform was built to address scalability at big data level. The following is a brief list of such features Operators are single thread executions and pay negligible context switching cost. The model ensures that code is clean and not cluttered with thread management Clean separation of business logic and operability. Enables ability to scale massively. The architecture is designed to enable massive scale in a linear fashion since the operability logic is completely within the platform Thread contention between operators is negligible due to inbuilt platform features. All event buffers are handled by the platform, and the operator api is executed in a single thread All streams are single write and multiple read configurations. This enables scale as there is no write contention to manage A lot of checks and balances are built in at compile time and launch time to enable runtime to be scalable. These include schema check at compile time, thereby removing any schema checks in run time Operations are designed to be executed in as unblocked way as possible. Upstream operators continue to process events even if downstream operators are down. All containers, and their components continue to function as is unless their upstream operators are down Execution is distributed across the cluster, and leverages resources as needed. The scale out is same as map reduce, operators, streams, buffers are partitioned as needed and launched on containers across the grid All resources: CPU, Memory, Network I/O, Buffers, Checkpoints Ability to scale linearly as much as possible by design Stream Locality: Thread local, Process local, Node local, Rack local DataTorrent Inc. 20

21 This feature enables users to align operators (i.e function calls) in the most optimal way. ThreadLocal uses call stack to pass events. The event needs no copy or buffering. Process Local has a buffer in between, but the tuple does not need serialization and thus scales better. Node Local uses local loopback to save on network interface card resource. Bufferserver: The event stream between writer and reader(s) is managed by bufferserver. But disjointing the write from read, the write operator can continue to process events without any blocking by the read operator(s). Buffer server also removes the dependency between two read operations. This separation is critical to scaling of the application NxM partitions: Supports NxM partitions and thereby allows each operator to partition as per their need Parallel Partitions: Enables a series of operators be partitions in unison. This allows a set of operators to leverage various stream localities, as well as go away with NxM partitions in between. This partition scheme simplifies the application significantly Unifiers: They allow users to use load balancing partition even when the computation is key based. This is a major gain as skew is completely removed by completely avoiding sticky key partition. Unifier runs in the upstream container and thus provides a clean interface for the downstream operator Cascading Unifiers: Are very useful in case of unifier having a huge fan in. Cascading unifier puts a limit on the fan ins and enables application to scale Partitionable Unifiers: Are extremely useful when the unifier needs to integrate state from all the partitions. Unique count operator is an example of this. Partitionable unifier allows streams to be partitioned as per the values of the key and thereby allowing scalable operators for functions such as unique counts Ability to insert split and merge operators: as a massive fan out of DataTorrent Inc. 21

22 read operations can be cascaded via a tree structure. At each level the fan out will be controlled. Single write streams also ensure that Runtime skew balancing: Skew in sticky key partitions can be managed at run time. This enables applications to adapt to runtime skew. For example if an internet site has USA audience during daytime and China audience during night time the skew in processing will be balanced in both cases. Scale is thus achieved by removal of skew as a bottleneck DataTorrent Inc. 22

23 5. Scalability Checklist An application developer needs to adapt the thought process and assumptions to ensure scalability of their big data application. A list of possible questions from scalability point of view are as follows Questions for every operator If the computation in the operator requires more CPU will the operator partition CPU requirement linearly? If the inbound or outbound I/O grows will the operator partition I/O linearly? If the state of the operator grows, will the operator be able to partition state linearly? If the computations being done by most optimal leaf level computations? Does the operator need to be split (or merged)? Is the operator able leverage all the needed knobs of the platform? Will the operator scale linearly if an application developer sets an arbitrary large window? Is the state of the operator optimized for scale? Does it block stream processing for too long? Is there a bottleneck in the operator computation? Can another design unblock it? Questions for stream design Does the stream have optimal locality? Are the fan out, and fan in features correctly being leveraged? Should this stream exist? Will it be better to merge the operators? Is the schema of the tuple correct. Will it enable or impede scale? Questions for application development Are there assumptions that were true in smaller scale that need to be discarded for designing a big data application? Assuming the application has access to infinite resources, what will DataTorrent Inc. 23

24 the ideal design be? Are computations being done by most optimal leaf level computations? Should the application be redesigned? Where are the architectural bottlenecks if throughput or computations, etc increase Is the application bottlenecked if an outbound connection is slow Is there a resource contention. For example are all partitions accessing the same database Is the design optimally leveraging the resource boundaries: Container size, cores per container, Network I/O. For example is one container using all the memory, while another container is using minimal memory. Common scalable patterns in customer use cases 1. CPU partitioning can be scaled if compute operation can be broken down into computations that can be done on partial data and then aggregated. A common examples include word count, and filtering during ingestion. Word count leverages sticky key partition will ensure that each partition counts the occurrences of a subset of keys. Aggregations can leverages load balancing partition to distribute computations, and use unifier for final aggregation. 2. Inbound I/O of an application, can be scaled linearly if the computations done as part of ingestion can be partitioned. Parallel partitions are usually very useful in such use cases. Example includes 3. Outbound I/O of an application scales by distribution if the external system can ingests all the I/O. For example if an output adapter writes to MongoDB, then the instance of MongoDb should be able to handle writes from all the partitions. 4. Internal I/O scales if the downstream operator can be partitioned. 5. If the state of the operator grows with data, and the logical operator needs access to the entire state, it will create a bottleneck for scalability. Such an application will need a re design. Unifiers provide DataTorrent Inc. 24

25 a very powerful way to combine load balancing (skew less partition) and yet be able to access aggregates at the end of the window. An example is to track 2nd level connections in a connected graph. If the design has all the connections in the same operator, then the action of getting the first level connections and then their next connections will grow with the size of the graph. A scalable design with more optimal leaf level operators are as follows a. Operator1: Emit all the 1st level connections, as id:1. This computation scales with sticky key load balancing b. Operator2: Emit all next level connections of these as id:2. This operator also scales with sticky key load balancing c. Operator3: Receives outputs of operator1 and operator2, and drops all id:2 if it received id:1 for that id. Emits the final result at the end of window. This operator also scales with sticky key load balancing 6. If in an application two adjacent operators have a massive throughput, leveraging Thread_Local or Process_Local should be considered. A very common example is generating dimensional keys and aggregating them. This use case can leverage stream locality along with parallel partition. 7. A possible bottleneck when processing millions of events/sec is that the application may need a lot of partitions. If the partition is Nx1, then all these partitions have to feed into a single downstream physical operator. The fan in may create a bottleneck. Cascading unifier solves this problem by unifying at a fixed fan out at each level. 8. It is common to have results of a computation as a HashMap. For example the result is say HashMap<String, Integer>. In this example a single tuple is emitted at the end of the window with the above schema. As the data grows so does the tuple. However it being a single tuple means the downstream operator has to consume the tuple in a single physical operator. One way out is to partition the current operator, and each physical partition sending out a HashMap. This patterns has problems with scalability as the operator will need DataTorrent Inc. 25

26 to ensure that a key only exist in one of the hashmap. This forces the operator to take in the input stream with sticky key partition. A more scalable way would be get the operator to emit KeyValPair<String, Integer>. The atomicity of the tuples is now moved to a KeyValPair. The downstream operator can now partition and consume the same data as per load, and has no upper limits on the partition. This problem can only be solved by changing the schema. Even if we put an operator that consumes the HashMap and immediately splits the tuple into KeyValPairs, this split operator still will need to take in one single large HashMap tuple. 9. Checkpointing is done by serializing the operator object to disk. Checkpointing is done at the end of the window. When checkpointing is being done, stream processing is paused. The time taken to serial is thus reduces the number of events that can be processed. It is advisable to reduce the size of the object by either partitioning based on the object size, or using stateless operators as much as possible, or by declaring variables that are not part of the state as transient. For example if an operator is aggregating and emitting the result at the end of every window, the operator is in effect stateless. Such an operator should ensure that all its variables are transient DataTorrent Inc. 26

27 6. Conclusion DataTorrent platform is designed to enable applications to scale and process big data in real time. The platform follows fundamental tenet of achieving scalability by distributing resource utilization across a commodity cluster. However scalability cannot be slapped on to a design, it has to be baked in. The design of an application must be scalable by construction. Platform has various knobs that users can avail of, and to do so the applications must be designed as such. This ability is one of the most crucial expertise needed to design scalable DataTorrent applications. While porting current applications care should be taken to remove assumptions that are baked into a design which will block the application to take advantages of the features in the platform. Designs carry within them constructs that derive from basic assumptions. DataTorrent is a big data platform that enables near massively scalable real time computations in near linear fashion. This requires new thought process and questioning of old assumptions that no longer are true. Big data scalability is a new paradigm. An application developer as well an operator developer must ask what can I do if I have massive compute resources at my disposal, how will my design look?. Application developers also need to understand how an application works when numerous real time tasks are running on various nodes throughout the cluster. How the data flows, windows sync up, and the whole applications runs in parallel. DataTorrent aims to enable an ecosystem that derives from commoditization of compute resources leveraged for real time applications DataTorrent Inc. 27

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

StreamStorage: High-throughput and Scalable Storage Technology for Streaming Data

StreamStorage: High-throughput and Scalable Storage Technology for Streaming Data : High-throughput and Scalable Storage Technology for Streaming Data Munenori Maeda Toshihiro Ozawa Real-time analytical processing (RTAP) of vast amounts of time-series data from sensors, server logs,

More information

CitusDB Architecture for Real-Time Big Data

CitusDB Architecture for Real-Time Big Data CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Architectures for massive data management

Architectures for massive data management Architectures for massive data management Apache Kafka, Samza, Storm Albert Bifet albert.bifet@telecom-paristech.fr October 20, 2015 Stream Engine Motivation Digital Universe EMC Digital Universe with

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

White Paper. How Streaming Data Analytics Enables Real-Time Decisions White Paper How Streaming Data Analytics Enables Real-Time Decisions Contents Introduction... 1 What Is Streaming Analytics?... 1 How Does SAS Event Stream Processing Work?... 2 Overview...2 Event Stream

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

A Brief Introduction to Apache Tez

A Brief Introduction to Apache Tez A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

More information

Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

More information

Using an In-Memory Data Grid for Near Real-Time Data Analysis

Using an In-Memory Data Grid for Near Real-Time Data Analysis SCALEOUT SOFTWARE Using an In-Memory Data Grid for Near Real-Time Data Analysis by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 IN today s competitive world, businesses

More information

Designing a Cloud Storage System

Designing a Cloud Storage System Designing a Cloud Storage System End to End Cloud Storage When designing a cloud storage system, there is value in decoupling the system s archival capacity (its ability to persistently store large volumes

More information

GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

Hadoop Cluster Applications

Hadoop Cluster Applications Hadoop Overview Data analytics has become a key element of the business decision process over the last decade. Classic reporting on a dataset stored in a database was sufficient until recently, but yesterday

More information

White Paper. Optimizing the Performance Of MySQL Cluster

White Paper. Optimizing the Performance Of MySQL Cluster White Paper Optimizing the Performance Of MySQL Cluster Table of Contents Introduction and Background Information... 2 Optimal Applications for MySQL Cluster... 3 Identifying the Performance Issues.....

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

Best Practices for Hadoop Data Analysis with Tableau

Best Practices for Hadoop Data Analysis with Tableau Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing

More information

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

In Memory Accelerator for MongoDB

In Memory Accelerator for MongoDB In Memory Accelerator for MongoDB Yakov Zhdanov, Director R&D GridGain Systems GridGain: In Memory Computing Leader 5 years in production 100s of customers & users Starts every 10 secs worldwide Over 15,000,000

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming by Dibyendu Bhattacharya Pearson : What We Do? We are building a scalable, reliable cloud-based learning platform providing services

More information

Resource Utilization of Middleware Components in Embedded Systems

Resource Utilization of Middleware Components in Embedded Systems Resource Utilization of Middleware Components in Embedded Systems 3 Introduction System memory, CPU, and network resources are critical to the operation and performance of any software system. These system

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF Non-Stop for Apache HBase: -active region server clusters TECHNICAL BRIEF Technical Brief: -active region server clusters -active region server clusters HBase is a non-relational database that provides

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.

More information

Scaling 10Gb/s Clustering at Wire-Speed

Scaling 10Gb/s Clustering at Wire-Speed Scaling 10Gb/s Clustering at Wire-Speed InfiniBand offers cost-effective wire-speed scaling with deterministic performance Mellanox Technologies Inc. 2900 Stender Way, Santa Clara, CA 95054 Tel: 408-970-3400

More information

Building Scalable Applications Using Microsoft Technologies

Building Scalable Applications Using Microsoft Technologies Building Scalable Applications Using Microsoft Technologies Padma Krishnan Senior Manager Introduction CIOs lay great emphasis on application scalability and performance and rightly so. As business grows,

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011 SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications Jürgen Primsch, SAP AG July 2011 Why In-Memory? Information at the Speed of Thought Imagine access to business data,

More information

Using distributed technologies to analyze Big Data

Using distributed technologies to analyze Big Data Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/

More information

PERFORMANCE MODELS FOR APACHE ACCUMULO:

PERFORMANCE MODELS FOR APACHE ACCUMULO: Securely explore your data PERFORMANCE MODELS FOR APACHE ACCUMULO: THE HEAVY TAIL OF A SHAREDNOTHING ARCHITECTURE Chris McCubbin Director of Data Science Sqrrl Data, Inc. I M NOT ADAM FUCHS But perhaps

More information

Introducing Storm 1 Core Storm concepts Topology design

Introducing Storm 1 Core Storm concepts Topology design Storm Applied brief contents 1 Introducing Storm 1 2 Core Storm concepts 12 3 Topology design 33 4 Creating robust topologies 76 5 Moving from local to remote topologies 102 6 Tuning in Storm 130 7 Resource

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

How to Choose Between Hadoop, NoSQL and RDBMS

How to Choose Between Hadoop, NoSQL and RDBMS How to Choose Between Hadoop, NoSQL and RDBMS Keywords: Jean-Pierre Dijcks Oracle Redwood City, CA, USA Big Data, Hadoop, NoSQL Database, Relational Database, SQL, Security, Performance Introduction A

More information

Traditional BI vs. Business Data Lake A comparison

Traditional BI vs. Business Data Lake A comparison Traditional BI vs. Business Data Lake A comparison The need for new thinking around data storage and analysis Traditional Business Intelligence (BI) systems provide various levels and kinds of analyses

More information

Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

Deployment Planning Guide

Deployment Planning Guide Deployment Planning Guide Community 1.5.0 release The purpose of this document is to educate the user about the different strategies that can be adopted to optimize the usage of Jumbune on Hadoop and also

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA Processing billions of events every day Neha Narkhede Co-founder and Head of Engineering @ Stealth Startup Prior to this Lead, Streams Infrastructure

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Data Management in the Cloud

Data Management in the Cloud Data Management in the Cloud Ryan Stern stern@cs.colostate.edu : Advanced Topics in Distributed Systems Department of Computer Science Colorado State University Outline Today Microsoft Cloud SQL Server

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services

More information

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale WHITE PAPER Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale Sponsored by: IBM Carl W. Olofson December 2014 IN THIS WHITE PAPER This white paper discusses the concept

More information

Trafodion Operational SQL-on-Hadoop

Trafodion Operational SQL-on-Hadoop Trafodion Operational SQL-on-Hadoop SophiaConf 2015 Pierre Baudelle, HP EMEA TSC July 6 th, 2015 Hadoop workload profiles Operational Interactive Non-interactive Batch Real-time analytics Operational SQL

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing Big Data Rethink Algos and Architecture Scott Marsh Manager R&D Personal Lines Auto Pricing Agenda History Map Reduce Algorithms History Google talks about their solutions to their problems Map Reduce:

More information

HadoopTM Analytics DDN

HadoopTM Analytics DDN DDN Solution Brief Accelerate> HadoopTM Analytics with the SFA Big Data Platform Organizations that need to extract value from all data can leverage the award winning SFA platform to really accelerate

More information

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011 Real-time Analytics at Facebook: Data Freeway and Puma Zheng Shao 12/2/2011 Agenda 1 Analytics and Real-time 2 Data Freeway 3 Puma 4 Future Works Analytics and Real-time what and why Facebook Insights

More information

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency WHITE PAPER Solving I/O Bottlenecks to Enable Superior Cloud Efficiency Overview...1 Mellanox I/O Virtualization Features and Benefits...2 Summary...6 Overview We already have 8 or even 16 cores on one

More information

Real-time Streaming Analysis for Hadoop and Flume. Aaron Kimball odiago, inc. OSCON Data 2011

Real-time Streaming Analysis for Hadoop and Flume. Aaron Kimball odiago, inc. OSCON Data 2011 Real-time Streaming Analysis for Hadoop and Flume Aaron Kimball odiago, inc. OSCON Data 2011 The plan Background: Flume introduction The need for online analytics Introducing FlumeBase Demo! FlumeBase

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Using In-Memory Computing to Simplify Big Data Analytics

Using In-Memory Computing to Simplify Big Data Analytics SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed

More information

Apache Hama Design Document v0.6

Apache Hama Design Document v0.6 Apache Hama Design Document v0.6 Introduction Hama Architecture BSPMaster GroomServer Zookeeper BSP Task Execution Job Submission Job and Task Scheduling Task Execution Lifecycle Synchronization Fault

More information

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

A Novel Cloud Based Elastic Framework for Big Data Preprocessing School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview

More information

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big

More information

Testing Big data is one of the biggest

Testing Big data is one of the biggest Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing

More information

Real-time Big Data Analytics with Storm

Real-time Big Data Analytics with Storm Ron Bodkin Founder & CEO, Think Big June 2013 Real-time Big Data Analytics with Storm Leading Provider of Data Science and Engineering Services Accelerating Your Time to Value IMAGINE Strategy and Roadmap

More information

Unified Batch & Stream Processing Platform

Unified Batch & Stream Processing Platform Unified Batch & Stream Processing Platform Himanshu Bari Director Product Management Most Big Data Use Cases Are About Improving/Re-write EXISTING solutions To KNOWN problems Current Solutions Were Built

More information

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra

More information

Performance Management in Big Data Applica6ons. Michael Kopp, Technology Strategist @mikopp

Performance Management in Big Data Applica6ons. Michael Kopp, Technology Strategist @mikopp Performance Management in Big Data Applica6ons Michael Kopp, Technology Strategist NoSQL: High Volume/Low Latency DBs Web Java Key Challenges 1) Even Distribu6on 2) Correct Schema and Access paperns 3)

More information

The Stratosphere Big Data Analytics Platform

The Stratosphere Big Data Analytics Platform The Stratosphere Big Data Analytics Platform Amir H. Payberah Swedish Institute of Computer Science amir@sics.se June 4, 2014 Amir H. Payberah (SICS) Stratosphere June 4, 2014 1 / 44 Big Data small data

More information

The Hadoop Distributed File System

The Hadoop Distributed File System The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu HDFS

More information

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens Realtime Apache Hadoop at Facebook Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens Agenda 1 Why Apache Hadoop and HBase? 2 Quick Introduction to Apache HBase 3 Applications of HBase at

More information

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Distributed Dynamic Load Balancing for Iterative-Stencil Applications Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

More information

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters A Framework for Performance Analysis and Tuning in Hadoop Based Clusters Garvit Bansal Anshul Gupta Utkarsh Pyne LNMIIT, Jaipur, India Email: [garvit.bansal anshul.gupta utkarsh.pyne] @lnmiit.ac.in Manish

More information

Cloud computing - Architecting in the cloud

Cloud computing - Architecting in the cloud Cloud computing - Architecting in the cloud anna.ruokonen@tut.fi 1 Outline Cloud computing What is? Levels of cloud computing: IaaS, PaaS, SaaS Moving to the cloud? Architecting in the cloud Best practices

More information

HDFS Users Guide. Table of contents

HDFS Users Guide. Table of contents Table of contents 1 Purpose...2 2 Overview...2 3 Prerequisites...3 4 Web Interface...3 5 Shell Commands... 3 5.1 DFSAdmin Command...4 6 Secondary NameNode...4 7 Checkpoint Node...5 8 Backup Node...6 9

More information

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Analysis and Modeling of MapReduce s Performance on Hadoop YARN Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and

More information

Networking in the Hadoop Cluster

Networking in the Hadoop Cluster Hadoop and other distributed systems are increasingly the solution of choice for next generation data volumes. A high capacity, any to any, easily manageable networking layer is critical for peak Hadoop

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

Hadoop Fair Scheduler Design Document

Hadoop Fair Scheduler Design Document Hadoop Fair Scheduler Design Document October 18, 2010 Contents 1 Introduction 2 2 Fair Scheduler Goals 2 3 Scheduler Features 2 3.1 Pools........................................ 2 3.2 Minimum Shares.................................

More information

HBase Schema Design. NoSQL Ma4ers, Cologne, April 2013. Lars George Director EMEA Services

HBase Schema Design. NoSQL Ma4ers, Cologne, April 2013. Lars George Director EMEA Services HBase Schema Design NoSQL Ma4ers, Cologne, April 2013 Lars George Director EMEA Services About Me Director EMEA Services @ Cloudera ConsulFng on Hadoop projects (everywhere) Apache Commi4er HBase and Whirr

More information

Google File System. Web and scalability

Google File System. Web and scalability Google File System Web and scalability The web: - How big is the Web right now? No one knows. - Number of pages that are crawled: o 100,000 pages in 1994 o 8 million pages in 2005 - Crawlable pages might

More information

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon. Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Infrastructure Matters: POWER8 vs. Xeon x86

Infrastructure Matters: POWER8 vs. Xeon x86 Advisory Infrastructure Matters: POWER8 vs. Xeon x86 Executive Summary This report compares IBM s new POWER8-based scale-out Power System to Intel E5 v2 x86- based scale-out systems. A follow-on report

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Cloud Management: Knowing is Half The Battle

Cloud Management: Knowing is Half The Battle Cloud Management: Knowing is Half The Battle Raouf BOUTABA David R. Cheriton School of Computer Science University of Waterloo Joint work with Qi Zhang, Faten Zhani (University of Waterloo) and Joseph

More information

A survey of big data architectures for handling massive data

A survey of big data architectures for handling massive data CSIT 6910 Independent Project A survey of big data architectures for handling massive data Jordy Domingos - jordydomingos@gmail.com Supervisor : Dr David Rossiter Content Table 1 - Introduction a - Context

More information

The Software Defined Hybrid Packet Optical Datacenter Network SDN AT LIGHT SPEED TM. 2012-13 CALIENT Technologies www.calient.

The Software Defined Hybrid Packet Optical Datacenter Network SDN AT LIGHT SPEED TM. 2012-13 CALIENT Technologies www.calient. The Software Defined Hybrid Packet Optical Datacenter Network SDN AT LIGHT SPEED TM 2012-13 CALIENT Technologies www.calient.net 1 INTRODUCTION In datacenter networks, video, mobile data, and big data

More information

Core and Pod Data Center Design

Core and Pod Data Center Design Overview The Core and Pod data center design used by most hyperscale data centers is a dramatically more modern approach than traditional data center network design, and is starting to be understood by

More information

Performance Testing of Big Data Applications

Performance Testing of Big Data Applications Paper submitted for STC 2013 Performance Testing of Big Data Applications Author: Mustafa Batterywala: Performance Architect Impetus Technologies mbatterywala@impetus.co.in Shirish Bhale: Director of Engineering

More information

BlobSeer: Towards efficient data storage management on large-scale, distributed systems

BlobSeer: Towards efficient data storage management on large-scale, distributed systems : Towards efficient data storage management on large-scale, distributed systems Bogdan Nicolae University of Rennes 1, France KerData Team, INRIA Rennes Bretagne-Atlantique PhD Advisors: Gabriel Antoniu

More information