Large-Scale Image Classification using High Performance Clustering

Size: px
Start display at page:

Download "Large-Scale Image Classification using High Performance Clustering"

Transcription

1 Large-Scale Image Classification using High Performance Clustering Bingjing Zhang, Judy Qiu, Stefan Lee, David Crandall Department of Computer Science and Informatics Indiana University, Bloomington {zhangbj, xqiu, steflee, Abstract Computer vision is being revolutionized by the incredible volume of visual data available on the Internet. A key part of computer vision, data mining, or any Big Data problem is the analytics that transforms this raw data into understanding, and many of these data mining approaches require iterative computations at an unprecedented scale. Often an individual iteration can be specified as a MapReduce computation, leading to the iterative MapReduce programming model for efficient execution of data-intensive iterative computations. We propose the Map-Collective model as a generalization of our earlier Twister system that is interoperable between HPC and cloud environments. In this paper, we study the problem of large-scale clustering, applying it to cluster large collections of million social images, each represented as a point in a high dimensional (up to 2048) vector space, into 1-10 million clusters. This K-means application needs 5 stages in each iteration: Broadcast, Map, Shuffle, Reduce and Combine, and this paper presents new collective communication approaches optimized for large data transfers. Furthermore one needs additional communication patterns from those familiar in MapReduce, and we develop collectives that integrate capabilities developed by the MPI and MapReduce communities. We demonstrate that a topologyaware and pipeline-based broadcasting method gives better performance than both MPI and other (Iterative) MapReduce systems. We present early results of an end-to-end computer vision application and evaluate the quality of the resulting image classifications, showing that increasing the number of feature clusters leads to improved classifier accuracy. Keywords - Social Images; Data Intensive; High Dimension; Iterative MapReduce; Collective Communication I. INTRODUCTION The rate of data generation now exceeds the growth of computational power predicted by Moore s Law. Mining and analyzing these massive data sources to translate large-scale data into knowledge-based innovation represents a major computational challenge. MapReduce frameworks have become popular in recent years for their scalability and fault tolerance in large data processing and for their simplicity in the programming interface. Hadoop [1], an open source implementation following Google s original MapReduce concept [2], has been widely used in industry and academia. But MapReduce does not directly address iterative solvers and basic matrix primitives, which Intel s RMS (Recognition, Mining and Synthesis) taxonomy [3] identifies as common computing kernels for computer vision, rendering, physical simulation, financial analysis and data mining. These and other observations [7] suggest that iterative data processing will be important to a spectrum of e- Science and e-research applications as a kernel framework for large-scale data processing. Several frameworks designed for iterative MapReduce have been proposed to solve this problem, including Twister [4], Spark [5], HaLoop [6], etc. For example, the initial version of Twister optimized data flow and reduced data transfer between iterations by caching invariant data in the local memory of compute nodes. However, it did not support the communication patterns needed in many applications. We observe that a systematic approach to collective communication is essential in many iterative algorithms. Thus we generalize the (iterative) MapReduce concept to Map-Collective, since large collectives are distinctive features of data intensive applications [8][9]. Large-scale computer vision is one application that involves big data and often needs iterative solvers, such as large-scale clustering stages. This application produces challenges requiring both new algorithms and optimizing parallel execution involving very large collective communication steps. We have addressed overall performance with an extension of Elkan's algorithm [10] to drastically speed up the computing (Map) step of clustering by use of the triangle inequality to remove unnecessary computation [8]. But this improvement increased the need for efficient communication, which is a major focus of this paper. Note that communication has been well studied, especially in MPI, but large-scale computer vision stresses different usage modes and message sizes from most previous applications. In this paper, we study the characteristics of a large-scale image clustering application and identify performance issues of collective communication. Our work is presented in the context of Twister, but the analysis is applicable to MapReduce and other data-centric computation solutions. In particular, the vision application requires 7 million image feature vectors to be clustered. We execute the application on 1000 cores (125 8-core nodes) with 10,000 Map tasks. The root node must broadcast 512 MB of data to all compute nodes, making sequential broadcast costly. In aggregation, 20 TB of intermediate data needs to be transferred from Map stage. Here we propose a topology-aware pipeline method to accelerate broadcast by a factor of over 120 compared with the sequential algorithm, and comparable with classic MPI methods [11] by slight (20%) outperformance in C and large factors over Java MPJ. We also use 3-stage aggregation to efficiently reduce intermediate data size by at least 90% within each stage. These methods provide important collective communication capabilities to our new iterative Map-Collective framework for data intensive applications. We evaluate our new methods on the PolarGrid cluster.

2 II. Figure 1. Workflow of the vocabulary learning application. MOTIVATING APPLICATION: LARGE-SCALE VISION Computer vision is being revolutionized by the volume of visual data on the Internet, including 500 million images uploaded every day to Facebook, Instagram and Snapchat, and 100 hours of video uploaded to YouTube every minute. These huge collections of social imagery are motivating many large-scale computer vision studies that can benefit from the infrastructure studied here. A major goal of these projects is to help organize photo collections; for instance, by automatically determining the type of scene [12], recognizing common objects [13] and landmarks [14], determining where on earth a photo was taken [16] [17], and so on. A. Scene Type Recognition Here we consider the specific problem of recognizing two general properties of a photo: whether it was taken in an urban or rural environment, and whether it was taken in a mountainous or relatively flat locale. A popular approach for visual classification tasks involves embedding images in a more discriminative space by representing them as collections of discriminative, invariant local image features. This is known as the Bag of Words (BoW) model, and is borrowed from techniques in information retrieval that represent documents as an unordered collection of words. To make this analogy work, we need to identify basic distinguishing features or visual words so that images can be encoded as histograms over this vocabulary [17]. One way of producing this vocabulary is to sample small patches from many images, perform clustering, and then use each of the cluster centroids to define a visual word. In our application, we sample five patches from each of 12 million images and describe each patch using the Histograms of Oriented Gradients (HOG) features [18]. HOG characterizes the local image features as a distribution of local edge gradients (see Figure 1), and produces a 512-dimensional vector for each patch. Once the vocabulary is generated, any image can be represented as a histogram over this vocabulary: given an image, we densely sample patches, compute HOG vectors from these patches, assign each vector Figure 2. Image clustering control flow in Twister with the new local aggregation feature in Map stage. to its nearest centroid in the vocabulary, and then count the number of times each centroid occurs. Support Vector Machines (SVM) [19] are trained on these features to learn a discriminative classifier between the image classes of interest (e.g. rural vs. urban). B. Large-scale Image Feature Clustering A major challenge with this approach is the clustering step, as ideally we would like to cluster billions of patches into millions of centroids. We confront this computational challenge by performing K-means clustering as a chain of Map computations separated by collective communications (Figure 2). The input data consists of a large number of feature vectors each with 512 dimensions. We compute Euclidean distances between feature vectors and cluster centers (centroids). Since the vectors are static over iterations, we partition (decompose) the vectors and cache each partition in memory, and assign a Map task to it during job configuration. At each iteration, the iteration driver broadcasts centroids to all Map tasks. Each Map task then assigns feature vectors to the nearest cluster. Map tasks calculate the sum of vectors associated with each cluster and count the total number of such vectors. To simplify the discussion, we only describe 1-stage aggregation here instead of 3-stage aggregation which is used in real implementation. The aggregation stage processes the output collected from each Map task and calculates new cluster centers by adding all partial sums of cluster center values together, then dividing by the total number of points in the cluster. By combining these new centroids from Reduce tasks, the iteration driver gets updated centroids and control flow enters the next iteration (Table I). Another major challenge of this application is the amount of feature data. Currently we have nearly 1 TB, corresponding to 60 million HOG feature vectors, and we expect problems to grow in size by one to two orders of magnitude. For such a large amount of input data, we can increase the number of machines to reduce the data size per node, but the total data size (of cluster centers) transferred in broadcasting and aggregation still grows as the number of centers multiplies.

3 TABLE I. ALGORITHMS AND IMPLEMENTATION OF THE IMAGE CLUSTERING APPLICATION (ONE AGGREGATION TASK ONLY) Algorithm 1 Iteration Driver numloop maximum iterations centroids[0] initial centroids value for(i 0; i < numloop; i i+1) broadcast(centroids[i]) runmapreduceiteration() centroids[i+1] getresults() Algorithm 2 Map Task vectors load and cached from files centroids load from memory cache mindis new int[numvectors] mincentroidindex new int[numvectors] for (i 0; i < numvectors; i i+1) for (j 0; j < numcentroids; j j+1) dis geteuclidean(vectors[i], centroids[j]) if (dis < mindis[i]) mindis[i] dis mincentroidindex[i] j localsum new int[numcentroids][512] localcount new int[numcentroids] for(i 0; i < numvectors; i i+1) localsum[mincentroidindex[i]] + vectors[i] localcount[mincentroidindex[i]] + 1 collect(localsum, localcount) Algorithm 3 Aggregation Task localsums collected from Map tasks localcounts collected from Map tasks totalsum new int[numcentroids][512] totalcount new int[numcentroids] newcentroids new byte[numcentroids][512] for (i 0; i < numlocalsums; i i+1) for (j 0; j < numcentroids; j j+1) totalsum[j] = totalsum[j] + localsums.get(i)[j] totalcount[j] = totalcount[j] + localcounts.get(i)[j] for (i 0; i < numcentroids; i i+1) newcentroids[i] = totalsum[i]/ totalcount[i] collect(newcentroids) For example, suppose we were to cluster 7 million 512- dimensional vectors into 1 million clusters. In one iteration, the execution is done on 1,000 cores in 10 rounds with a total of 10,000 Map tasks. Each task only needs to cache 700 vectors (358KB) and each node needs to cache 56K vectors, about 30MB in total. But in broadcast, the size of 1 million cluster centers is about 512MB. Therefore the centroids data per task received through broadcasting is much larger than the image feature vectors per task. Since each Map task needs a full copy of the centroids, the total data sent through broadcasting grows with the problem size and the number of nodes. For the example above, the total broadcast is about 64 GB (because Map tasks are executed as threads, broadcast data can be shared among tasks on one node). We now reach the aggregation stage. Here each Map task generates about 2 GB of intermediate data, for a total of about 20 TB. This far exceeds the total memory size of 125 nodes (each of which has 16 GB memory; 2 TB in total), and also makes the computation difficult to scale, since the data size grows with the number of nodes. In this paper, we do 3- stage aggregation to solve this problem. In the first stage, we reduce 20 TB of intermediate data to 250 GB with local aggregation (Figure 2). But due to limited memory, 250 GB still cannot be aggregated directly on one node. Thus we further divide the output data from each Map task into 125 partitions (numbered with Partition ID 0 to 124) and use 125 tasks (1 task per node) to do group-by aggregation at the second stage. In this way, each node only processes 2 GB of data: Node 0 processes Partition 0 from all Map tasks, Node 1 processes Partition 1 from all Map tasks, and so on. The output of group-by aggregation on each node is about 4 MB, so the 125 nodes only need to gather about 512 MB to the driver in the third stage of aggregation. In Table II we give the time complexity of each part of the algorithm, with as the number of nodes, as the number of Map tasks, as the number of centroids, as the total number of feature vectors, and as the number of dimensions. The improved aggregation time complexity is expressed as the sum of time complexity for 3 aggregation stages. We approximate the improved Map task running time using triangle inequalities from [8]. III. TABLE II. TIME COMPLEXITY OF EACH STAGE Stage Simple Improved Broadcasting Map [8] Aggregation COLLECTIVE COMMUNICATION IN PARALLEL PROCESSING FRAMEWORKS In this section, we compare several big data parallel processing tools (MPI, Hadoop MapReduce and iterative computation tools such as Twister and Spark [5]) and show how they are applied. We analyze collective communication patterns and how intermediate data is handled in each tool. We expect the ideas of these tools to eventually converge into a single environment, which our new optimal communication is aimed for in order to serve big data applications. A. Runtime Models MPI, Hadoop, Twister and Spark aim at different types of applications and data, and have very different runtime models. We classify parallel data processing and communication patterns [20] in Figure 3. On the data tool spectrum, Hadoop and MPI are at opposite ends while Twister, Spark and other MapReduce-like tools are in the middle with mixed features extended from both Hadoop and MPI. Here we propose using systematic support of collectives to unify these models. 1) MPI is a computation-centric solution that serves scientific applications that are compute intensive and have complicated communication patterns. It can spawn parallel processes to compute nodes, although users need to define the computation in each process and handle communication between them. MPI is highly optimized in communication performance, offering basic point-to-point and also collective communication operations. MPI runs on HPC and supercomputers where data is decoupled from computation and stored in a shared and distributed file system. MPI does not have unified data abstractions analogous to the keyvalue pairs in MapReduce; it is flexible enough to process different types of data. MPI lacks fixed control flow,

4 endowing it with the flexibility to emulate MapReduce or other user-defined models [21-23]. 2) MapReduce and Hadoop: On the other hand, Hadoop is data-centric. HDFS [25] is used to store and manage big data so that users are freed from the data accessing and loading steps required in MPI. In addition, computations are performed where the data is located to promote scalability. Key-Value pairs are the core data abstraction in MapReduce. With keys, intermediate data values are labeled and regrouped automatically without explicit communication commands. Hadoop is very suitable for processing records and logs, which are easy to split into small Key-Value pairs with words or lines. A typical record or log processing task includes information extraction and regrouping, which are easily expressed in Map-Reduce: intermediate Key-Value pairs are first extracted from records and logs in Map tasks, then regrouped in shuffling, and finally processed by Reduce tasks. But Hadoop is inefficient for many applications served by MPI because its control flow is constrained to Map-Shuffle-Reduce patterns. Differences in the algorithms and data characteristics of an application also influence scheduling. In many scientific applications, the workload can be evenly distributed across compute nodes and in-memory communication between processes happens frequently; as a result, MPI uses static scheduling. But for log and record processing, the workload in each task is hard to estimate, since some tasks generate more Key-Value pairs than others and all data exchange is based on HDFS (not in-memory communication). Because of this, Hadoop uses dynamic scheduling. Hadoop also provides task-level fault tolerance, while MPI does not. 3) Twister and Spark: Twister and Spark are somewhere between MPI and Hadoop. Twister provides an easy-to-use, data-centric solution to process big data in data mining and scientific applications. Twister s control flow is defined by iterations of MapReduce jobs, with the output of each iteration sent to the input to the next iteration. The data in Twister is abstracted as Key-Value pairs for intermediate data regrouping as per the needs of the application, and it uses static scheduling: data is pre-split and evenly distributed to nodes based on the available computing slots (the number of cores). Tasks are then sent to where the data is located. Spark also targets iterative algorithms but boasts flexible iteration control with separated RDD operations called transformations and actions. A RDD is an in-memory data abstraction in distributed computing with fault tolerance support. Typical operations on RDDs include MapReducelike operations such as Map, GroupByKey (like Shuffle but without sort) and ReduceByKey (same as Reduce), and also relational database-like operations like Union, Join, and Cartesian-Product. Spark scheduling is similar to Dryad but considers available memory of RDD partitions. RDD s lineage graph is examined to build a DAG of stages for late execution. The Key-Value abstraction in MapReduce also requires more work in the form of partitioning before data loading, because the data abstracted in computation is usually not organized in the same way as in the file system. For example, the data in our image clustering application is stored in a set of text files, with each file containing feature vectors generated from a set of images. The file lengths and the total number of files varies. However, in the computation we set the number of data partitions to be a multiple of the number of cores so that we can evenly distribute computation. (a) Map Only (Pleasingly Parallel) Input map Output - CAP3 Gene Analysis - Smith-Waterman Distances - Document conversion (PDF -> HTML) - Brute force searches in cryptography - Parametric sweeps - PolarGrid Matlab data analysis No Communication (b) Classic MapReduce Input map reduce - High Energy Physics (HEP) Histograms - Distributed search - Distributed sorting - Information retrieval - Calculation of Pairwise Distances for sequences (BLAST) (c) Iterative MapReduce Input iterations map - Expectation maximization algorithms - Linear Algebra - Data mining include K- means clustering - Deterministic Annealing Clustering - Multidimensional Scaling (MDS) - PageRank Collective Communication reduce (d) Loosely Synchronous Many MPI scientific applications utilizing wide variety of communication constructs including local interactions - Solving Differential Equations and - particle dynamics with short range forces Figure 3. Classification of Applications Ultimately we need to convert raw data on disks to cooked data ready for computation. Currently we split original data files into evenly-sized data partitions. But Hadoop can automatically load data from blocks with custom InputSplit or InputFormat classes, while MPI requires the user to split data or use special formats like HDF5 and NetCDF. B. Collective Communication and Intermediate Data Handling MPI researchers have made major progress on communication optimization. However MPI focuses on low latency communication, while our application is notable for large messages where latency is less relevant. With the support of high-performance hardware, communication is well optimized. Users can communicate in two ways; one is to call send/receive APIs to customize communication between processes, and another is to invoke libraries to do collective communication, which is a type of communication in which all the workers are required to participate. Often data-centric problems run on clouds which consist of commodity machines, and the cost of transferring big intermediate data is high. For example, in our image clustering application, broadcasting about 500MB is required, and our findings show that this operation can be a great burden to current data-centric technology. This makes it necessary to systematically develop a Map-Collective approach with a wide range of collectives, and to optimize for big data instead of the MPI simulation optimizations. There are traditionally 7 collective communication operations discussed in MPI [26]: four data redistribution operations (broadcast, scatter, gather, allgather) and three data consolidation operations (reduce(-to-one), reducescatter, all-reduce). Neither Hadoop, Twister, nor Spark explicitly define all these operations. Hadoop has implicit Pij MPI

5 broadcast based on distributed cache, and since Hadoop data is managed by HDFS, direct memory-to-memory collective communication does not exist. Twister and Spark support broadcast explicitly but not allgather and allreduce. Another iterative MapReduce system, Twister4Azure [27], supports all-gather and all-reduce. In a later paper we will describe integrating these different collectives into a single system that runs on HPC clusters (Twister) and PaaS cloud systems (Twister4Azure), changing the implementation to optimize for each infrastructure. The same high level collective primitive is used on each platform with different under-the-hood optimizations. In broadcasting, data abstraction and methods are very different across these systems: data is abstracted as an array buffer in MPI, as an HDFS file in Hadoop, and as an object in Twister and Spark (Key-Value pairs in Twister and arbitrary objects in Spark). Several algorithms are used for broadcasting. MST (Minimum-Spanning Tree) is typically used in MPI [26]. In this method, nodes form a minimum spanning tree and data is forwarded along the links. In this way, the number of nodes which have data grows in geometric progression, such that the performance model is: where is the number of nodes, is the data size, is the communication startup time and is the data transfer time per unit. This method is much better than simple broadcasting by reducing the complexity term to. But it is still insufficient when compared with scatter-allgather bucket algorithm. This algorithm is used in MPI for long vector broadcasting, which follows the style of divide, distribute and gather [28]. In the scatter phase, it scatters the data to all the nodes. Then in all-gather the bucket algorithm is used, which views the nodes as a chain. At each step, every node sends data to its right neighbor [31]. By taking advantage of the fact that messages traversing a link in opposite directions do not conflict, all-gather is done in parallel without any network contention. The performance model is: (2) In large data broadcasting, assuming is small, the broadcasting time is about. This is much better than the MST method because time is constant. However, it is not easy to set a global barrier between the scatter and all-gather phases in a cloud system to enable all the nodes to do allgather at the same time. As a result, some links will have more load than others and thus cause network contention. We implemented this algorithm and provide test results on IU PolarGrid (Table III). The execution time is roughly, but as the number of nodes increases, it gets slightly slower. MPI also has InfiniBand multicast-based broadcast [29]. Many clusters support hardware-based multicast, but it is not reliable: send order is not guaranteed and size of each send is limited. So after the first stage of multicasting, broadcast is enhanced with a chain-like broadcast, which is reliable TABLE III. SCATTER-ALLGATHER BUCKET ALGORITHM ON IU POLARGRID WITH 1 GB DATA BROADCASTING Node# Seconds enough to make sure every process has completed receiving data. In the second stage, the nodes are formed into a virtual ring topology. Each MPI process that gets a message via multicast serves as a new root within the virtual ring and exchanges data to its predecessor and successor in the ring. This is similar to the bucket algorithm we discussed above. Though the above methods are not perfect, they all reduce broadcast time to a great extent. Still, none of them are applied in data-centric solutions, where instead a simple algorithm is commonly used. Hadoop relies on HDFS to do broadcasting: the Distributed Cache is used to cache data from HDFS to local disks of the compute nodes. The API addcachefile and getlocalcachefiles work together to complete the process of broadcasting. There is no special optimization, and data download speed depends on the number of replicas in HDFS [25]. This method generates significant overhead (factor of ) when handling big data, as we show in our experiments. We call this a simple algorithm because it sends data to all nodes one by one. Initially in Twister, a single message broker is used to do broadcasting in a similar way (Figure 4). Multiple brokers in Twister or multiple replicas in HDFS could contain a simple 2-level broadcasting tree to ease performance issues, but this does not solve the problem. In the next section we propose a chain-based broadcasting algorithm suitable for cloud systems. Master Node Twister Driver Main Program Twister Daemon reduce map Worker Pool Local Disk Worker Node B Cacheable tasks B B B Scripts perform: Data distribution, data collection, and partition file creation Pub/Sub Broker Network and Collective Communication Service Twister Daemon Worker Pool Local Disk Worker Node Figure 4. Initial Twister architecture Meanwhile, instead of using the simple algorithm, Spark enhances broadcast speed by using BitTorrent, a well-known technology in file sharing. Spark s broadcast programming interface is very different from MPI and Twister. Due to the mechanism of late execution, broadcast is finished not in a single step but in two stages. When broadcast is invoked, the data is not broadcast until the parallel tasks are executed. Broadcasting happens when 10 printing tasks are invoked, so it doesn t execute on all the nodes, only those where tasks

6 are located. The performance of Spark Broadcasting is discussed with a simple case in Section VI. For data consolidation operations, reduce(-to-one) and reduce-scatter are parallel to a shuffle-reduce operation in data-centric solutions. Reduce-(to-one) can be viewed as shuffling with only one Reducer while reduce-scatter can be viewed as shuffling with all workers as reducers. However, these operations are fundamentally different in terms of semantics because shuffle-reduce is based on Key-Value pairs while reduce-(to-one) and reduce-scatter are based on vectors. The former is more flexible than the latter. In shuffle-reduce the number of keys in one worker can be arbitrary. For example, in WordCount, for a particular word word1, one worker could generate the pair (word1, 1) multiple times. Or there might be no such Key-Value pairs if the worker found no examples of word1. In addition, a value can be an arbitrary object and encapsulate many different data types. However, reduce-scatter requires the size of vectors for reduction to be identical in all workers. Because the number of words and counts in each worker are hard to estimate, it is difficult to replace shuffle-reduce with reducescatter in WordCount. We cannot use collective communication in MPI directly to simulate shuffle-reduce in MPI. Instead we customize the communication with send/receive calls [30], although the final MPI program is not simple and users have to explicitly designate where the data goes. By contrast, in data-centric solutions, data is managed by the framework and automatically goes to the destination according to the keys. Thus shuffling can be viewed as a unique type of collective communication in data-centric solutions, and runtimes implement it differently. Hadoop manages intermediate data on disk, so data is first partitioned, sorted and spilled to disk, then transferred, merged and sorted again at the Reducer. However, shuffling in Twister is much simpler and has better performance: data is only regrouped by keys and transferred in memory, and there is no sorting [4]. In Spark, there are two APIs related to shuffling: groupbykey and sort. We argue that sort is not a necessary part of shuffle. In Twister, all intermediate data is in memory so keys can be regrouped through a large hash map, whereas in Hadoop, merging is done on disk and thus sorting is required to put similar keys together. In many applications such as WordCount and our clustering application, it is sufficient for the data to be grouped without being sorted, as key rank is not important. As a result, we view shuffle as regroup. Due to the difference between similar concepts in different models, we generalize the abstraction of data consolidation operations in the Map-Collective model as aggregation. So a data consolidation operation such as shuffle-reduce is considered as regroup-aggregate. In our image clustering application, we implemented 3-stage aggregation. In summary, collective communication is not well studied in the context of MapReduce and other data-centric solutions, and may not be optimally implemented in current runtimes. Though collective communication operations have been used in MPI for decades, they are still missing in MapReduce despite being required by applications; for instance, broadcast and aggregate are two important operations in our image clustering application. We introduce a new Twister control flow with optimized broadcasting and local aggregation features (Figure 2). Our collectives are implemented asynchronously but the broadcast step of K- means naturally synchronizes the algorithm at each iteration. IV. BROADCAST COLLECTIVE To address the need for high performance broadcast, we replace the broker methods in Twister with a chain method based on TCP sockets, customizing the message routing. A. Chain Broadcasting Algorithm We propose the chain method, based on pipelined broadcasting [31]. Compute nodes in a Fat-Tree topology [32] are treated as a linear array and data is forwarded from one node to its neighbor chunk-by-chunk. Performance is enhanced by dividing the data into small chunks and overlapping transmission of data. For example, the first node sends a chunk to the second node. Then, while the second node sends the data to the third node, the first node sends another chunk to the second node, and so on [31]. This pipelined data forwarding is called a chain, and is particularly suitable for the large data in our communication problem. Pipelined broadcasting performance depends on the chunk size. Ideally, if every transfer can be overlapped seamlessly, the theoretical performance is as follows: (3) Here is the number of nodes, is the number of data chunks, is the data size, is communication startup time and is data transfer time per unit. In large data broadcasting, assuming is small and is large, the main term of the formula is, which is close to constant. From the formula, the best number of chunks is when [31]. However, in practice, the actual chunk size is decided by the system, and the speed of data transfers on each link could vary as network congestion might occur when data is continuously forwarded into the pipeline. As a result, formula (3) cannot be applied directly to predict real performance of our chain broadcasting implementation. But the experimental results we present later still show that as increases, the broadcasting time remains constant and close to the bandwidth limit. B. Rack-Awareness The chain method is suitable for racks with Fat-Tree topologies, which are common in clusters and data centers [33]. Since each node has only two links (less than the number of links per node in Mesh/Torus [34]), chain broadcasting can maximize the utilization of links per node. We also make the chain topology-aware by allocating nodes within the same rack nearby in the chain. Assuming the racks are numbered as,,, we put nodes in at the

7 beginning of the chain, nodes in after nodes in, nodes in after nodes in, etc. Otherwise, if the nodes in are intertwined with nodes in in the sequence, the flow will jump between switches, overburdening the core switch. To support rack-awareness, we save configuration information on each node. A node discovers its predecessor and successor by loading this information when starting. Future work could replace this with automatic topology detection. C. Implementation Our chain broadcast implementation starts with a request from the root to the first node in the topology-aware chain. Then the master keeps sending a small portion of the data to the next slave node. In the meantime, every node in the chain creates a connection to its successor. Each node receives partial data from the socket stream, stores it in the application buffer, and forwards it to the next node (details are in [30]). V. 3-STAGE AGGREGATION As discussed in Section II, direct aggregation is impossible for large intermediate data. We do aggregation in 3 stages. In the first stage of aggregation, since each Map task is running at the thread level, we reduce the intermediate data size on each node with local aggregation across Map tasks. We organize the local aggregated data into partitions. Then in the second stage, we do group-by aggregation across nodes for each partition. In the third stage, we do gather to aggregate partitions from nodes to the driver. To support local aggregation, we provide a general interface to help users define any associative commutative aggregation operation, which is the addition of two partial sums in our case. As Map tasks work at the thread level on compute nodes, we do local aggregation in the memory VI. EXPERIMENTS A. Performance Comparison among Runtime Frameworks To evaluate performance of the proposed broadcasting and aggregation mechanisms, we conducted experiments on IU PolarGrid in the context of both kernel and application benchmarking. PolarGrid cluster uses a Fat-Tree topology to connect compute nodes. These are split into sections of 42 nodes which are then tied together with 10 GigE to a Cisco Nexus core switch. For each section, nodes are connected with 1 GigE to an IBM System Networking Rack Switch G8000. This forms a 2-level Fat-Tree structure with the first level of 10 GigE connections and the second level of 1 GigE connections. Each compute node has a 4-core 8-thread Intel Xeon CPU E GHz processor, with 12 MB of L2 cache per core. Each compute node has 16 GB total memory. The results demonstrate that the chain method achieves the best performance on big data broadcasting compared with other MapReduce and MPI frameworks, and 3-stage aggregation significantly outperforms the original aggregation. 1) Broadcasting We test the following methods: Twister chain method, MPI_BCAST in Open MPI [35], and broadcast in MPJ Express 0.38 [36]. We also compare the current Twister chain broadcasting method with other designs such as simple broadcasting and chain method without topology awareness. Figure 5 shows performance of simple broadcast as a baseline on IU PolarGrid. Owing to 1 GB connection on each node, the transmission speed is about 8 seconds per GB, which matches the bandwidth setting. With our new algorithm, we reduce the cost by a factor of from to, where is the number of nodes and is data size. Figure 6 compares the chain method and MPI_BCAST method in Open MPI. The time cost of the new chain method shared by Map tasks. Once a Map is finished, it doesn t send data immediately, instead caching it to a shared memory pool. When key conflicts happen, the program invokes a user-defined operation to merge two Key-Value pairs. A barrier is set so that data in the pools is not transferred until all the Map tasks in a node are finished. By trading communication time with computation time, we reduce data transfer significantly. is stable as the number of processes increases. This matches the broadcasting formula (3) and we can conclude that with proper implementation, the actual performance of the chain method can achieve near constant execution time. Moreover, the new method achieves 20% better performance than MPI_BCAST in Open MPI. Figure 7 compares the Twister chain method and the broadcast method in MPJ. Due to exceptions, we couldn t launch MPJ broadcast on 2 GB data. MPJ broadcasting method is also stable as the number of

8 processes grows, but is four times slower than our Java implementation. Furthermore, there is a significant gap between 1-node broadcasting and 25-node broadcasting in MPJ. However if the chain sequence is randomly generated but not topology-aware, performance degrades quickly as the scale grows. Figure 9 shows that the chain method with topology-awareness is 5 times faster than without. For broadcasting within a single switch, we see that as expected, there is not much difference between the two methods. However, as the number of nodes and the number of racks increase, execution time increases significantly. When there are more than 3 switches, execution time becomes stable and does not change much; this is because there are many interswitch communications and performance is constrained by the 10 GB bandwidth and the throughput ability of the core switch. 2) Analysis of BitTorrent Broadcasting Here we examine performance of BitTorrent broadcast in Spark, which is reported to be excellent [37]. In our testing, however, the current Spark (v ) has good performance on a few nodes but degrades quickly as the number of nodes increases. We executed only 1 task after invoking broadcast on 500 MB of data, and the result was stable as the number of nodes grew. When we set the number of receivers equal to the number of nodes, performance issues emerged: on 25 nodes with 25 tasks, the performance was the same as with 1 receiver, but on 50 nodes with 50 tasks, broadcast time increased threefold. When we tried to broadcast from 75 nodes to 150 nodes, none of these executed successfully. Increasing the number of receivers to the number of cores 3) Local Aggregation in 3-stage Aggregation To reduce intermediate data from 1 TB to 125 GB, we use local aggregation. The output per node is reduced to 1 GB and total data for shuffling is only about 125 GB, plus the cost of regrouping is only 10% of the original time. B. Evaluation of Image Recognition Application Finally, we tested the full execution of our image classification application. As described in Section II, the first step of constructing our classifiers was to create the vocabulary. To do this, we took a set of 12 million publiclyavailable, geo-tagged Flickr photos, randomly sampled 5 patches from each image, and then computed the 512 dimensional HOG feature for each patch. We then used our high-performance implementation of K-means to cluster these 7.42 million 512 dimensional HOG vectors into 1 million cluster centers. Specifically, we created 10,000 map tasks on 125 compute nodes, so that each node had 80 tasks and each task cached 742 vectors. For 1 million centroids, the amount of broadcast data was about 512 MB, the aggregation data size was about 20 TB, and the data size after local aggregation was about 250 GB. Since the total amount of physical memory on our 125 nodes was 2 TB, we could not even execute the program unless local aggregation was performed first. Collective communication cost per iteration was 169 seconds (less than 3 minutes). Note that we are currently developing a faster K-means algorithm [8] [10] that will drastically reduce the current hour-long computation time in the Map stage by a factor of the dimensionality of the feature vectors, and so the improved communication time is highly relevant. gave similar results, scaling to 50 nodes only in Figure 8. We tried 1 GB and 2 GB broadcasts only to find these did not scale to 25 nodes. Since the broadcast topology in BitTorrent is built dynamically, it is unknown if it follows the patterns in MPI, such as a MST. Also important in broadcast is that this topology follows rack-awareness. A special dynamic topology detection technique is mentioned [37] but it may not be used in the current version. For sending chunk size, it mentions that 4 MB performs well but without any further analysis. Once the vocabulary was created, we trained classifiers for our problems of inferring elevation gradient and population. Our classification task is to determine if a given image was taken in an area with a high or low value for these attributes e.g. a rural area or a city. To build the classifiers, we collected approximately 15,000 geo-tagged images from Flickr which were labeled with ground-truth attribute values. We encoded these images as described in section 2.1 using the vocabulary built through K-means, which produces a single k-dimensional feature vector for each image. We then used half the data to train linear

9 Support Vector Machine classifiers [19], and reserved the rest for testing. For the elevation gradient task, our classifiers achieved an accuracy of 57.23%, versus a random baseline of 50%, while the urbanicity classifier performed better at 68.27%. In interpreting these numbers, it is important to note that we used an unfiltered set of Flickr images, including photos that do not have any visual evidence of where they were taken. To put these accuracies into context, we conducted an experiment in which we asked people to perform the same tasks (classify urbanicity and elevation gradient) on a random subset of 1000 images. They performed slightly better on elevation gradient (60.0% vs. 57.2%) but significantly better on urbanicity (80.8% vs. 68.2%). Figure 11 presents sample images, showing both correct and incorrect classifications. Figure 10 presents the relationship between the size of the vocabulary (which, in turn, is the size of the feature clustering task) and the classifier accuracy. We observe that the elevation gradient classifier quickly reached a saturation point such that the additional information encoded in the larger vocabularies is not very helpful. On the other hand, for the urbanicity attribute, the accuracy improved by a steady 2-3% for each tenfold increase in vocabulary. These results demonstrate that, in some cases, large gains in image classification accuracy can be made by employing vast dictionaries like those the proposed framework can support. VII. RELATED WORK In Section III we discussed data processing runtimes and compared the collective communication within them. Here we summarize the analysis and add other observations. Collective communication algorithms are thoroughly studied in MPI, although the Java implementations are less well optimized. Each operation has several different algorithms based on message size and network topology (such as linear array, mesh and hypercube [26]). Basic algorithms are the pipeline broadcast method [31], the minimum-spanning tree method, the bidirectional exchange algorithm, and the bucket algorithm. Since these algorithms have different advantages, combinations of these algorithms (polymorphism) are widely used to improve communication performance [26], and some solutions also provide automatic algorithm selection [38]. Other papers have a different focus than our work. Some of them study small data transfers up to a level of megabytes [24] [26] while some solutions rely on special hardware support [28]. The data type in these papers is typically vectors and arrays, whereas we are considering objects. Many algorithms such as all-gather operate under the assumption that each node has the same amount of data [26] [28], which is uncommon in a MapReduce model. There are several solutions to improve the performance of data transfers in MapReduce. Orchestra [37] is one such global control service and architecture that manages intraand inter-transfer activities in the Spark system (we gave some test results in section 3.1). It not only provides control, scheduling and monitoring on data transfers, but also optimizes broadcasting and shuffling. For broadcasting, it uses an optimized BitTorrent-like protocol called Cornet, augmented by topology detection. For shuffling, Orchestra employs weighted shuffle scheduling (WSS) to set the weight of the flow proportional to the data size; we noted earlier this optimization is not relevant in our application. Hadoop-A [39] provides a pipeline to overlap the shuffle, merge and reduce phases and uses an alternative Infiniband RDMA-based protocol to leverage RDMA inter-connects for fast shuffling. MATE-EC2 [40], a MapReduce-like framework for Amazon EC2 and S3, uses local and global aggregation for data consolidation. This strategy is similar to what was done in Twister, but since it focuses on the EC2 environment, the design and implementation are totally different. imapreduce [41] and ihadoop [42] are iterative MapReduce frameworks that optimize data transfers between iterations asynchronously when there is no barrier between iterations. However, this design does not work for applications that need to broadcast data in every iteration because all the outputs from Reduce tasks are needed for every Map task. Daytona [43] is an Azure-based iterative MapReduce runtime developed by Microsoft Research using some of the ideas of Twister with Excel DataScope as an application allowing Cloud or Excel input and output datasets. The focus of this paper is on the algorithms, system design and implementation to support large-scale computer vision, not computer vision itself. Still, we will briefly mention a few papers related to ours. A small but growing number of papers have considered the opportunities and challenges of image classification on large-scale online social photo collections. Hays and Efros [15] and Li et al [14] use millions of images to build classifiers for place and landmark recognition, respectively, while Xiao et al [12] build a huge dataset of images and test various features and classifiers on scene type recognition. The de facto standard classification technique is to extract features like HOG [18], cluster into a vocabulary using K-means [17], write each image as a histogram over the vocabulary, and then learn a classifier using an SVM [19]. We are not aware of any work that has built vocabularies on the scale that we consider in this paper. VIII. CONCLUSIONS AND FUTURE WORK In this paper, we demonstrated first steps towards a high performance Map-Collective programming model and runtime using the requirements of a large-scale clustering algorithm. We replaced broker-based methods and designed and implemented a new topology-aware chain broadcast algorithm, which reduces the time burden of broadcast by at least a factor of 120 on 125 nodes, compared with the simple broadcast algorithm. It gives 20% better performance than the best C/C++ MPI methods, 4 times better than Java MPJ, and 5 times better than non-optimized (for topology) pipeline-based method on 150 nodes. The aggregation cost after using local aggregation is only 10% of the original time. Collective communication has significantly improved the intermediate data transfer for large-scale clustering problems. In future work, we will apply the Map-Collective framework to other iterative applications, including Multi-

10 Dimensional Scaling, where the all-gather primitive is needed. We will also extend current work to include an allreduce collective that is an alternative approach to K-means. The resulting Map-Collective model that captures the full range of traditional MapReduce and MPI features will be evaluated on Azure [27] as well as IaaS/HPC environments. On the application side, we will apply our technique to classifying types of scene attributes other than urbanicity and elevation; our goal is to build classifiers for hundreds or thousands of scene attributes, and then use these for place recognition by cross-referencing to GIS maps. We are investigating other techniques like deep learning [7] for building the vocabulary, which will also apply iterative algorithms to large-scale data like the ones we have considered here. ACKNOWLEDGEMENTS We gratefully acknowledge support from National Science Foundation grant OCI and IIS , and from the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either express or implied, of IARPA, AFRL, or the U.S. Government. REFERENCES [1] Apache Hadoop. [2] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In Symp. on OS Design and Implementation, [3] P. Dubey. A Platform 2015 Model: Recognition, Mining and Synthesis Moves Computers to the Era of Tera. Compute-Intensive, Highly Parallel Applications and Uses. (9) 2, [4] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H, Bae, J. Qiu, G. Fox. Twister: A Runtime for iterative MapReduce. In Workshop on MapReduce and its Applications at ACM HPDC 2010, [5] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster Computing with Working Sets. In HotCloud, [6] Y. Bu, B. Howe, M. Balazinska, and M. Ernst. Haloop: Efficient Iterative Data Processing on Large Clusters. In Proc. of the VLDB Endowment, September [7] J. Dean, et al. Large Scale Distributed Deep Networks. In Neural Information Processing Systems Conf., [8] J. Qiu, B. Zhang, Mammoth Data in the Cloud: Clustering Social Images. In Clouds, Grids and Big Data, IOS Press, [9] B. Zhang, J. Qiu, High Performance Clustering of Social Images Collective Programming Model. In Proc. of 2013 ACM Symposium on Cloud Computing. [10] Charles Elkan, Using the triangle inequality to accelerate K-means. In Intl. Conf. on Machine Learning [11] MPI Forum, MPI: A Message Passing Interface, in Proc. of Supercomputing, [12] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from abbey to zoo. CVPR, [13] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object Detection with Discriminately Trained Part Based Models. PAMI [14] Y. Li, D. Crandall, and D. Huttenlocher. Landmark Classification in Large-Scale Image Collections. In ICCV, [15] J. Hays and A. Efros. IM2GPS: Estimating geographic information from a single image. In IEEE CVPR, [16] D. Crandall, L. Backstrom, J. Kleinberg, and D. Huttenlocher. Mapping the world s photos. In Intl. World Wide Web Conf., G. Csurka, C. Dance, L. Fan, J. Williamowski and C. Bray. Visual categorization with bags of keypoints. In ECCV Workshop on Statistical Learning in Computer Vision, [17] G. Csurka, C. Dance, L. Fan, J. Williamowski and C. Bray. Visual categorization with bags of keypoints. In ECCV Workshop on Statistical Learning in Computer Vision, [18] N. Dalal, B. Triggs. Histograms of Oriented Gradients for Human Detection. In IEEE CVPR, 2005 [19] T. Joachims. Making large-scale SVM learning practical. In Advances in Kernel Methods, MIT Press, [20] J. Ekanayake, T. Gunarathne, J. Qiu, G. Fox, S. Beason, J. Choi, Y. Ruan, S. Bae, and H. Li. Applicability of DryadLINQ to Scientific Applications. Community Grids Lab, Indiana University, [21] S. Plimpton, K. Devine, MapReduce in MPI for Large-scale Graph Algorithms, Parallel Computing, [22] X. Lu, B. Wang, L. Zha, and Z. Xu. Can MPI Benefit Hadoop and MapReduce Applications? Intl. Conf. on Parallel Processing Workshops, [23] T. Hoefler, A. Lumsdaine, and J. Dongarra. Towards Efficient MapReduce Using MPI. In PVM/MPI Users' Group Meeting, [24] P. Balaji, A. Chan, R. Thakur, W. Gropp, and E. Lusk. Toward message passing for a million processes: Characterizing MPI on a massive scale Blue Gene/P. Computer Science Research and Development, (24), pp , [25] K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop Distributed File System. Symp. on Mass Storage Systems and Technologies, [26] E. Chan, M. Heimlich, A. Purkayastha, and R. A. van de Geijn. Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience (19), 2007,. [27] T. Gunarathne, B. Zhang, T.-L. Wu, and J. Qiu. Scalable Parallel Computing on Clouds Using Twister4Azure Iterative MapReduce. Future Generation Computer Systems (29), pp , [28] N. Jain and Y. Sabharwal. Optimal Bucket Algorithms for Large MPI Collectives on Torus Interconnects. ACM Intl. Conf. on Supercomputing, [29] T. Hoefler, C. Siebert, and W. Rehm. Infiniband Multicast: A practically constant-time MPI Broadcast Algorithm for large-scale InfiniBand Clusters with Multicast. IEEE Intl. Parallel & Distributed Processing Symposium, 2007 [30] Bingjing Zhang and Judy Qiu, High Performance Clustering of Social Images in a Map-Collective Programming Model. Technical Report, July 3, 2013 [31] J. Watts and R. van de Geijn. A pipelined broadcast for multidimensional meshes. Parallel Processing Letters (5) [32] C. Leiserson. Fat-trees: universal networks for hardware efficient supercomputing. IEEE Transactions on Computers, (34) 10, [33] R. N. Mysore. PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric. In SIGCOMM, 2009 [34] S. Kumar, Y. Sabharwal, R. Garg, and P. Heidelberger. Optimization of All-to-all Communication on the Blue Gene/L Supercomputer. In Intl. Conf. on Parallel Processing, [35] Open MPI, [36] MPJ Express, [37] M. Chowdhury et al. Managing Data Transfers in Computer Clusters with Orchestra, ACM SIGCOMM, [38] H. Mamadou, T. Nanri, and K. Murakami. A Robust Dynamic Optimization for MPI AlltoAll Operation. Intl. Symp. on Parallel & Distributed Processing, [39] Y. Wang et al. Hadoop Acceleration Through Network Levitated Merge. In Intl. Conf. for High Performance Computing, Networking, Storage and Analysis, [40] T. Bicer, D. Chiu, and G. Agrawal. MATE-EC2: A Middleware for Processing Data with AWS. In ACM Intl. Workshop on Many Task Computing on Grids and Supercomputers, [41] Y. Zhang, Q. Gao, L. Gao, and C. Wang. imapreduce: A distributed computing framework for iterative computation. In DataCloud, [42] E. Elnikety, T. Elsayed, and H. Ramadan. ihadoop: Asynchronous Iterations for MapReduce. In IEEE Intl. Conf. on Cloud Computing Technology and Science, [43] Daytona.

Large-Scale Image Classification using High Performance Clustering

Large-Scale Image Classification using High Performance Clustering Large-Scale Image Classification using High Performance Clustering Bingjing Zhang, Judy Qiu, Stefan Lee, David Crandall Department of Computer Science and Informatics Indiana University, Bloomington Abstract.

More information

Twister4Azure: Data Analytics in the Cloud

Twister4Azure: Data Analytics in the Cloud Twister4Azure: Data Analytics in the Cloud Thilina Gunarathne, Xiaoming Gao and Judy Qiu, Indiana University Genome-scale data provided by next generation sequencing (NGS) has made it possible to identify

More information

HPC ABDS: The Case for an Integrating Apache Big Data Stack

HPC ABDS: The Case for an Integrating Apache Big Data Stack HPC ABDS: The Case for an Integrating Apache Big Data Stack with HPC 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox gcf@indiana.edu http://www.infomall.org

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Analysis and Modeling of MapReduce s Performance on Hadoop YARN Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

Big Data and Scripting Systems beyond Hadoop

Big Data and Scripting Systems beyond Hadoop Big Data and Scripting Systems beyond Hadoop 1, 2, ZooKeeper distributed coordination service many problems are shared among distributed systems ZooKeeper provides an implementation that solves these avoid

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

Map Reduce / Hadoop / HDFS

Map Reduce / Hadoop / HDFS Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person

More information

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

RevoScaleR Speed and Scalability

RevoScaleR Speed and Scalability EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

More information

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Processing with Google s MapReduce. Alexandru Costan 1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

More information

Using Cloud Technologies for Bioinformatics Applications

Using Cloud Technologies for Bioinformatics Applications Using Cloud Technologies for Bioinformatics Applications MTAGS Workshop SC09 Portland Oregon November 16 2009 Judy Qiu xqiu@indiana.edu http://salsaweb/salsa Community Grids Laboratory Pervasive Technology

More information

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:

More information

Hadoop Cluster Applications

Hadoop Cluster Applications Hadoop Overview Data analytics has become a key element of the business decision process over the last decade. Classic reporting on a dataset stored in a database was sufficient until recently, but yesterday

More information

From GWS to MapReduce: Google s Cloud Technology in the Early Days

From GWS to MapReduce: Google s Cloud Technology in the Early Days Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Big Fast Data Hadoop acceleration with Flash. June 2013

Big Fast Data Hadoop acceleration with Flash. June 2013 Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional

More information

Efficient Data Replication Scheme based on Hadoop Distributed File System

Efficient Data Replication Scheme based on Hadoop Distributed File System , pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Spark. Fast, Interactive, Language- Integrated Cluster Computing Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC

More information

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

Using In-Memory Computing to Simplify Big Data Analytics

Using In-Memory Computing to Simplify Big Data Analytics SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed

More information

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

More information

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems Aysan Rasooli Department of Computing and Software McMaster University Hamilton, Canada Email: rasooa@mcmaster.ca Douglas G. Down

More information

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib

More information

Distributed Aggregation in Cloud Databases. By: Aparna Tiwari tiwaria@umail.iu.edu

Distributed Aggregation in Cloud Databases. By: Aparna Tiwari tiwaria@umail.iu.edu Distributed Aggregation in Cloud Databases By: Aparna Tiwari tiwaria@umail.iu.edu ABSTRACT Data intensive applications rely heavily on aggregation functions for extraction of data according to user requirements.

More information

GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

More information

Lecture 2 Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

More information

A Collective Communication Layer for the Software Stack of Big Data Analytics (Thesis Proposal)

A Collective Communication Layer for the Software Stack of Big Data Analytics (Thesis Proposal) A Collective Communication Layer for the Software Stack of Big Data Analytics (Thesis Proposal) Bingjing Zhang School of Informatics and Computing Indiana University Bloomington December 8, 2015 Table

More information

Big Application Execution on Cloud using Hadoop Distributed File System

Big Application Execution on Cloud using Hadoop Distributed File System Big Application Execution on Cloud using Hadoop Distributed File System Ashkan Vates*, Upendra, Muwafaq Rahi Ali RPIIT Campus, Bastara Karnal, Haryana, India ---------------------------------------------------------------------***---------------------------------------------------------------------

More information

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group HiBench Introduction Carson Wang (carson.wang@intel.com) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is

More information

CloudClustering: Toward an iterative data processing pattern on the cloud

CloudClustering: Toward an iterative data processing pattern on the cloud CloudClustering: Toward an iterative data processing pattern on the cloud Ankur Dave University of California, Berkeley Berkeley, California, USA ankurd@eecs.berkeley.edu Wei Lu, Jared Jackson, Roger Barga

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Portable Data Mining on Azure and HPC Platform

Portable Data Mining on Azure and HPC Platform Portable Data Mining on Azure and HPC Platform Judy Qiu, Thilina Gunarathne, Bingjing Zhang, Xiaoming Gao, Fei Teng SALSA HPC Group http://salsahpc.indiana.edu School of Informatics and Computing Indiana

More information

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in

More information

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Mobile Cloud Computing for Data-Intensive Applications

Mobile Cloud Computing for Data-Intensive Applications Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, vct@andrew.cmu.edu Advisor: Professor Priya Narasimhan, priya@cs.cmu.edu Abstract The computational and storage

More information

Processing Large Amounts of Images on Hadoop with OpenCV

Processing Large Amounts of Images on Hadoop with OpenCV Processing Large Amounts of Images on Hadoop with OpenCV Timofei Epanchintsev 1,2 and Andrey Sozykin 1,2 1 IMM UB RAS, Yekaterinburg, Russia, 2 Ural Federal University, Yekaterinburg, Russia {eti,avs}@imm.uran.ru

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Guidelines for Selecting Hadoop Schedulers based on System Heterogeneity

Guidelines for Selecting Hadoop Schedulers based on System Heterogeneity Noname manuscript No. (will be inserted by the editor) Guidelines for Selecting Hadoop Schedulers based on System Heterogeneity Aysan Rasooli Douglas G. Down Received: date / Accepted: date Abstract Hadoop

More information

The Hadoop Framework

The Hadoop Framework The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen nils.braden@mni.fh-giessen.de Abstract. The Hadoop Framework offers an approach to large-scale

More information

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011 Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis

More information

GeoGrid Project and Experiences with Hadoop

GeoGrid Project and Experiences with Hadoop GeoGrid Project and Experiences with Hadoop Gong Zhang and Ling Liu Distributed Data Intensive Systems Lab (DiSL) Center for Experimental Computer Systems Research (CERCS) Georgia Institute of Technology

More information

Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research

More information

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra

More information

Benchmarking Cassandra on Violin

Benchmarking Cassandra on Violin Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

Evaluating HDFS I/O Performance on Virtualized Systems

Evaluating HDFS I/O Performance on Virtualized Systems Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang xtang@cs.wisc.edu University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing

More information

Big Data in Test and Evaluation by Udaya Ranawake (HPCMP PETTT/Engility Corporation)

Big Data in Test and Evaluation by Udaya Ranawake (HPCMP PETTT/Engility Corporation) Big Data in Test and Evaluation by Udaya Ranawake (HPCMP PETTT/Engility Corporation) Approved for Public Release. Distribution Unlimited. Data Intensive Applications in T&E Win-T at ATC Automotive Data

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan IBM Research China IBM Almaden

More information

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

Fast Analytics on Big Data with H20

Fast Analytics on Big Data with H20 Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Big Data Storage, Management and challenges. Ahmed Ali-Eldin Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big

More information

BSC vision on Big Data and extreme scale computing

BSC vision on Big Data and extreme scale computing BSC vision on Big Data and extreme scale computing Jesus Labarta, Eduard Ayguade,, Fabrizio Gagliardi, Rosa M. Badia, Toni Cortes, Jordi Torres, Adrian Cristal, Osman Unsal, David Carrera, Yolanda Becerra,

More information

Introduction to DISC and Hadoop

Introduction to DISC and Hadoop Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and

More information

BSPCloud: A Hybrid Programming Library for Cloud Computing *

BSPCloud: A Hybrid Programming Library for Cloud Computing * BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China liuxiaodongxht@qq.com,

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

1. Comments on reviews a. Need to avoid just summarizing web page asks you for: 1. Comments on reviews a. Need to avoid just summarizing web page asks you for: i. A one or two sentence summary of the paper ii. A description of the problem they were trying to solve iii. A summary of

More information

MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially

More information

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be

More information

Cloud computing doesn t yet have a

Cloud computing doesn t yet have a The Case for Cloud Computing Robert L. Grossman University of Illinois at Chicago and Open Data Group To understand clouds and cloud computing, we must first understand the two different types of clouds.

More information

Distributed Framework for Data Mining As a Service on Private Cloud

Distributed Framework for Data Mining As a Service on Private Cloud RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &

More information

Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

More information

Networking in the Hadoop Cluster

Networking in the Hadoop Cluster Hadoop and other distributed systems are increasingly the solution of choice for next generation data volumes. A high capacity, any to any, easily manageable networking layer is critical for peak Hadoop

More information

CSE-E5430 Scalable Cloud Computing Lecture 11

CSE-E5430 Scalable Cloud Computing Lecture 11 CSE-E5430 Scalable Cloud Computing Lecture 11 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 30.11-2015 1/24 Distributed Coordination Systems Consensus

More information

Big Data Analytics Hadoop and Spark

Big Data Analytics Hadoop and Spark Big Data Analytics Hadoop and Spark Shelly Garion, Ph.D. IBM Research Haifa 1 What is Big Data? 2 What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software

More information

Scala Storage Scale-Out Clustered Storage White Paper

Scala Storage Scale-Out Clustered Storage White Paper White Paper Scala Storage Scale-Out Clustered Storage White Paper Chapter 1 Introduction... 3 Capacity - Explosive Growth of Unstructured Data... 3 Performance - Cluster Computing... 3 Chapter 2 Current

More information

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research

More information

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof. CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensie Computing Uniersity of Florida, CISE Department Prof. Daisy Zhe Wang Map/Reduce: Simplified Data Processing on Large Clusters Parallel/Distributed

More information

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Praveenkumar Kondikoppa, Chui-Hui Chiu, Cheng Cui, Lin Xue and Seung-Jong Park Department of Computer Science,

More information

MATE-EC2: A Middleware for Processing Data with AWS

MATE-EC2: A Middleware for Processing Data with AWS MATE-EC2: A Middleware for Processing Data with AWS Tekin Bicer Department of Computer Science and Engineering Ohio State University bicer@cse.ohio-state.edu David Chiu School of Engineering and Computer

More information

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: (jpineau@cs.mcgill.ca) TAs: Pierre-Luc Bacon (pbacon@cs.mcgill.ca) Ryan Lowe (ryan.lowe@mail.mcgill.ca)

More information