Tech Report TR-WP Analyzing Virtualized Datacenter Hadoop Deployments Version 1.0

Size: px
Start display at page:

Download "Tech Report TR-WP3-6-2.9.2013 Analyzing Virtualized Datacenter Hadoop Deployments Version 1.0"

Transcription

1 Longitudinal Analytics of Web Archive data European Commission Seventh Framework Programme Call: FP7-ICT , Activity: ICT Contract No: Tech Report TR-WP Analyzing Virtualized Datacenter Hadoop Deployments Version 1.0 Editor: Work Package: Status: Aviad Pines WP3 Final Date: Dissemination Level: Public

2 Tech Report TR-WP Analyzing Virtualized Datacenter Hadoop Deployments Project Overview Project Name: LAWA Longitudinal Analytics of Web Archive data Call Identifier: FP7-ICT Activity Code: ICT Contract No: Partners: 1. Coordinator: Max-Planck-Institut für Informatik (MPG), Germany 2. Hebrew University of Jerusalem (HUJI), Israel 3. European Archive Foundation (EA), Netherlands 4. Hungarian Academy of Sciences (MTA-SZTAKI), Hungary 5. Hanzo Archives Limited (HANZO), United Kingdom 6. University of Patras (UP), Greece Document Control Title: Analyzing Virtualized Datacenter Hadoop Deployments Author/Editor: Aviad Pines Document History Version Date Author/editor Description/comments Aviad Pines Master thesis

3 Contents Abstract... 4 Introduction... 4 Related Work and Anticipated Impact... 5 Methodology... 6 Amazon Deployments... 7 Elastichosts Deployments... 9 A Mathematical Model of Computation Time Map Phase Tipping Point Summary Acknowledgements References... 20

4 Abstract This paper discusses the performance of Hadoop deployments on virtualized Data Centers such as Amazon EC2 and Elastichosts, both when the Hadoop cluster is located in a single data center, and when it is spread in a cross-datacenter deployment. We analyze the impact of bandwidth between nodes on cluster performance. Introduction Map Reduce is a programming model for processing and generating large data sets. It is in common use by the commercial sector, as well as researchers, to process large quantities of data. Among the users of Map Reduce one can find Amazon, Facebook and Yahoo [1] [2]. The typical Hadoop deployment takes place in a server room, or in large deployments, in a data center. In this paper, we aimed to analyze the performance of Hadoop deployments of various sizes in the virtualized data centers provided by two cloud server providers: Amazon [3] and Elastichosts [4]. In addition to investigating Hadoop performance in the cloud, we have deployed Hadoop in between two of Elastichosts data centers, and have investigated Hadoop performance across the Internet. The different bandwidth speeds between server, local rack and cluster are a known fact [5], and we have at first expected an order of magnitude drop in bandwidth between a single datacenter and an inter-datacenter Hadoop deployment (both due to the pattern of an order of magnitude drop by distance and by our experience with Internet speeds). We have also expected that Hadoop will have difficulty coping with the larger latency introduced by an Internet link. While we have indeed observed the expected drop in bandwidth, we were surprised to discover that Hadoop does indeed run across the open Internet (at least between data centers).

5 The ability to run Hadoop deployments over multiple datacenters has an obvious advantage. Companies can utilize a number of their datacenters in order to run large jobs without being confined to a single location, thus reducing the amount of time it takes to complete a job. In addition data collection can occur in separate locations in the company's geography. Enable computation among the distinct location avoid the need to move the data, costing the company more money for the extra storage. After analyzing both deployments (Amazon and Elastichosts) we have used the data to create a mathematical model of a Hadoop job execution time. We have used this model to investigate the effects of various variables on the job execution time. Related Work and Anticipated Impact Previous work has showed that MapReduce offers a flexible, and effective tool for data processing, proving to be efficient, fault tolerant and highly scalable [6] [2]. It should be noted that since Hadoop's implementation favors homogeneous cluster structure and not deploying it in such a manner so may impede performance work has been carried out regarding the improvement of performance in scenarios where the cluster is heterogeneous [7]. Performance studies of Hadoop have also been conducted, in both regular and virtualized environments [8] [9], but they seem to mostly focus on the total job running times and efficient scheduling of jobs and tasks, and ignore the underlying network performance as uninteresting. A more recent work has been published that studies the effects of network traffic in the confines of a small, 5-node Hadoop cluster [10]. Aside of Hadoop performance, schedulers and protocols for networking traffic within datacenters have also been designed to improve networking performance in data centers and studies have been carried out as to characterize the networking traffic of large data centers [11] [12] [13]. Additionally, work has been done to investigate the advantages of the shift from a monolithic, large, single-site datacenter to smaller, more distributed clouds [14]. It is our aim to expand upon this base of knowledge by contributing input as to the network traffic generated by Hadoop and allow insight into how the network performance potentially affects Hadoop performance. We also aim to show that deploying Hadoop across the Internet is a real possibility.

6 Methodology Hadoop itself provides various metrics about its performance, each task logs its start and end times, the amount of data it has consumed and produced and the amount of time that it ran. This is useful but insufficient since the bandwidth and the idle time statistics of the network is not measured. In order to obtain more detailed metrics we have implemented a Statistic Server which is an additional Hadoop component that serves as a sink for traffic reports. Traffic reports are small descriptors sent from each node in the cluster which detail network transfers by their starting timestamp, duration, size, and type (HDFS Reads, Reducer Inputs or HDFS Writes). By using these traffic reports we were able to measure both the bandwidth at various parts of the job, the typical size of a transfer, and also the total time the network was idle of Hadoop traffic. Care was taken to reduce the volume of this traffic so as to not affect the actual running time of Hadoop jobs on the modified cluster. The statistics server does not provide us the ability to see management traffic, e.g. traffic sent from the Name Node in order to control the execution flow of the job. Since the management traffic is negligible when compared to the rest of the traffic of the Hadoop job, therefore it has no effect on the cluster performance. Our job of choice was a simple word count benchmark run on the Gutenberg dataset, which is a collection of about 23,000 free books in plain text format (totaling about 10GB of text) and is available from the Gutenberg Project [15]. Our virtualized clusters were of two types: The Amazon EC2 cluster used m1.large instances, which provide 2 computational cores and 7.5GB of RAM. On Elastichosts we ve specified machines with a single core, 2GB of RAM and a clock setting of 2,000 MHz. In both cases, however, we do not really know which virtualization setup our cluster actually runs on, the load of other VMs on the physical host, or the general networking load in the data center. This is the nature of the cloud environment. For our cluster configuration, we have specified that each machine will have a number of map and reduce slots identical to the number of virtualized cores that it is configured with.

7 When calculating our networking metrics we have taken care to omit any transfer which has its source and destination on the same machine. Such transfers are handled by the OS and Hypervisor protocol stacks and do not use any actual networking resources in the data center itself. As such, they do not affect the network load and the link speed does not affect them. Amazon Deployments First, we deployed our modified Hadoop onto Amazon EC2. We used EC2 VMs for this, and have not used Elastic Map Reduce as we wanted to manage our machines by ourselves and wanted to use a custom Hadoop version which included our support for statistic gathering. Elastic Map Reduce allows users to execute map-reduce jobs without tinkering with a cluster configuration of the management of individual hosts. The cluster size ranged from 8 to 128 machines with two cores each. Overall, we wanted to see whether network performance affects the running time of Hadoop jobs and whether running Hadoop over the Internet is possible. At first we wanted to see whether the number of machines involved in the computation would affect the performance of the data center network. The following graph illustrates the effect that the Hadoop cluster size has on the average bandwidth of a network path between two machines in the virtualized datacenter.

8 Figure 1: Mean bandwidth per number of hosts on Amazon EC2 Figure 2: Running time per number of cores on Amazon EC2

9 Figure 3: Cumulative transfers for Gutenberg job on Amazon EC2, 64 hosts (128 cores) It is rather interesting to note that there are very few transfers of map inputs. Indeed, the number of non-local map tasks is extremely small. This is due to the uniform distribution of input blocks on machines (giving each machine the same amount of local input) and the homogeneity of machines across the Hadoop cluster (giving each machine similar processing power to crunch through its local maps). Indeed, the bulk of transfers occur in the shuffle phase, where map outputs get transferred to their reducers. Elastichosts Deployments Our second set of experiments centered on benchmarking Hadoop in an inter-datacenter deployment. We have deployed 4 and 16 slave clusters across two Elastichosts datacenters: One was located in San Antonio, Texas, while the other in Los Angeles, California. In both cases, half the machines would be located in each of the data centers. This means that in the 4-machines job, approximately 66% of the traffic was sent through the cross-datacenter link while in the 16- machines job approximately 53% of the traffic was sent through it. In both cases, more than half of the traffic was cross-datacenter.

10 Statistics Server Master hadoop1 hadoop3 hadoop2 hadoop4 Figure 4: In the 4-machines deployment, hadoop1 is sending on the intra-datacenter link to hadoop2, and on the inter-datacenter link to hadoop3 and hadoop4, meaning that approximately 66% of its traffic is sent through the inter-datacenter link. The RTT for packets between both data centers was about 45ms on average, gathered from 5000 pings that were sent between data centers. As expected, there is a significant difference between the bandwidth of the two links. An order of magnitude difference was measured between machines not in the same data center and those which are located in the same one. However, the following graph shows us that there is no actual effect on the running time.

11 Figure 5: Running time per number of cores on Amazon EC2 and Elastichosts This is a non-trivial result: The fact that the running time is similar on the 16 core case (with Elastichosts being slightly faster) is not what we expected to see. Network performance between two hosts in the different data centers is comparable to the download speeds of a consumer internet connection, and we expected this to affect computation time, but this was not the case. Upon examining the actual busy-times of the network paths between cluster nodes, we have discovered that network transfers only occurred during, on average, 3% of the computation time on Amazon and 6.3% of the computation time on Elastichosts, with a variance of and respectively. This leaves the network idle for the vast majority of the job s runtime and explains this result. This has led us to try and devise a mathematical model for the computation time of a Hadoop job that would be able to explain this and would enable us to explore the effects of bandwidth on computation time. A Mathematical Model of Computation Time Our mathematical model works as follows. We separate the model into two parts, the mapping and shuffle phase and the reduce phase.

12 The mapping time can be one of two scenarios either the map tasks take longer than the shuffle phase or vice versa. Assuming the former, the mapping time will be completing all the map tasks plus the time of a single shuffle for the last map output. On the other hand, assuming that the shuffle phase takes longer than the map phase, then the time will be the first map task time (since the actual copy of data in the shuffle will only start after it), plus the time that it takes to shuffle all the map tasks in parallel to the reducers. We will take the mapping time as the maximum between the two cases. Let's analyze the first case, where the map tasks take longer than the shuffle. Looking at the mapping phase, and assuming equal distribution of input between all the mappers, the average mapping task will take M m, where m is the average map execution time, M is the total number of N map tasks and N is the number of cores. Now let's look at a single shuffle time. It is the average output of the mapper divided by the number of hosts (all the calculations are done in parallel), and divided by the speed of the network link. Since the cross data center transfer will always be the bottleneck in terms of the network speed, we will use it to see the worst case scenario. Denoting O m as the average output per mapper, l c as the cross data center link speed, we get O m 1 for a N l c single shuffle time. Proceeding to the second case, where the shuffle phase takes longer time then the map tasks time, we have the first map tasks plus all the shuffle time, which is a single shuffle time calculated in the previous section, multiplied by the number of map tasks that will have to be shuffled to the reducers. We will get the following for the second case of the map phase: m + O m N 1 l c M N (1) And the map phase will be the maximum between the two cases: max M N m + O m N 1 l c m + O m N 1 l c M N (2)

13 The reduce phase is the number of reducers we have per host times the average reducer execution time. Denoting R as the total number of reduce tasks and r max as the maximum reducer execution time, we get the following for the reduce phase: R N r max (3) The reason that we are taking the maximum reducer execution time and not the average, is that unlike the mapping phase, where we can assume that Hadoop will uniformly distribute the mapper tasks among the mappers, the reducers accept keys. A single key cannot be distributed among several reducers, making the largest key the bottleneck for the reducer tasks. This makes our complete model of computation to be M = Number of map tasks. N = Number of slots. m = Average map execution time. O m =Average output per mapper. l c = Cross data center link speed. max r max = Maximum reducer execution time. M N m + O m N 1 l c m + O m N 1 l c M N + r max (4) How accurate is our model? We have used the Hadoop metrics to extract the inputs to our equation from some of the jobs we have executed. The results are illustrated in the following table: Total Map/Reduce Slots Estimated Running Time (minutes) Actual Running Time (minutes) Error 4 (Elastichosts) (Elastichosts) (Amazon)

14 32 (Amazon) (Amazon) (Amazon) (Amazon) Figure 6: Comparing mathematical model results with actual running times Where the data collected for each of the variables of the equation were: Average Map Total Total Max Reducer Number Execution Average Mapper Link Speed Number of Number of Execution Time of Cores Time Output (Bytes) (K/sec) Reduce Map Tasks (Milliseconds) (Milliseconds) Tasks Figure 7: Data used to estimate the mathematical model results on Amazon EC2 Total Total Average Map Max Reducer Number of Average Mapper Link Speed Number of Number of Execution Time Execution Time Cores Output (Bytes) (K/sec) Reduce Map Tasks (Milliseconds) (Milliseconds) Tasks Figure 8: Data used to estimate the mathematical model results on Elastichosts As we can see, our model, while not being perfect, does follow the general behavior of the Hadoop clusters. This means that we can explore the effects of the input variables on the (estimated) running times. The most significant difference between the Amazon cluster and the Elastichosts one is that in the latter we have a cross datacenter network link which is significantly slower than the inner datacenter one. Amazon's bandwidth ranges from 4000 to 7000kb/sec. Elastichosts inner

15 datacenter bandwidth ranges from 6000 to 7000kb/sec, and the cross datacenter bandwidth is by an order of magnitude slower it ranges from 560 to 810kb/sec. Total Map/Reduce slots Amazon Mean Bandwidth Elastichosts in datacenter bandwidth Elastichosts cross datacenter bandwidth Figure 9: Comparison of mean bandwidth between Amazon EC2 and Elastichosts In our formula we choose the slowest link, since it will be the one creating the network bottleneck. We tested different parameters to see how the link bandwidth affects the overall job length. Link bandwidth (K/sec) Job Length (Milliseconds) Job Length (Hours) :00: :31: :01: :46: :01: :31: :25: :25: :25: :25: Figure 10: Using the mathematical model to compare the effects of the link strength on the job length We can see that the last significant affect was between the 6k/sec and the 7k/sec, and afterwards the changes are measured only in a few seconds improvement. It seems that network length is not significant at all for the job length, and if you have more than a 7k/sec connection (which is considered slow today even for the average home connection, not mentioning datacenter link

16 speeds) upgrading the link bandwidth will not help speeding up the job length. This discovery is not surprising when taking into consideration the time the network was idle, which was as mentioned before 97% of the job lifetime. Instead of changing the network link, we can increase the mapper output. This will increase the amount of data we need to transfer and is equivalent do changing the network speed. In our jobs the mapper output was bytes. We increase the mapper output in our model to see how would affect the job length. Mapper Output Size (Byres) Job Length (Milliseconds) Job Length (Hours) h 42m 40s 660ms h 42m 40s 660ms h 42m 40s 660ms h 42m 40s 660ms h 42m 40s 660ms h 42m 40s 662ms h 42m 40s 678ms E h 42m 40s 841ms E h 42m 42s 467ms E h 49m 25s 513ms Figure 11: Using the mathematical model to see the effects of mapper output on job length We can see that increasing the mapper output size by seven orders of magnitude increases the job length by less than a second, and only when increasing by eight orders of magnitude you can see a difference that can be measured by more than a couple of minutes. This means that even if we have a job that creates vast amount of data, it will not affect the overall job length. In real jobs however, we expect that increasing the output bytes of the mapper will increase the running time of the reducers, since more input is being sent to them. Under the assumption that the increased output has the same distribution as previously, we can derive the conclusion that each reducer will receive a larger amount of input that is proportional to the increased mapper output. Since the running time is linearly proportional to the input data size [7], we can assume

17 that the running time of each reducer will be increased by the same factor of the mapper output increase. Mapper Output Size Reducer Running Job Length (Byres) Time (ms) (Milliseconds) Job Length (Hours) h 39m 59s 662ms h 39m 59s 676ms h 39m 59s 821ms h 40m 1s 270ms h 40m 15s 760ms h 42m 40s 662ms h 6m 49s 678ms E7 1.61E h 8m 19s 841ms E8 1.61E h 23m 21s 467ms E9 1.61E h 0m 4s 513ms Figure 12: Using the mathematical model to see the effects of mapper output and reducer running time on the job length We can see that the increase in reducer time increase the overall job length, and increasing the data by tenfold only increase the overall running time by less than a factor of two. Map Phase Tipping Point Next thing that we wanted to explore is when the tipping point between the mapper time and shuffle time in the map phase occurs. Looking again at the mathematical model, the map phase consists of

18 max M N m + O m N 1 l c m + O m N 1 l c M N We wanted to explore when that tipping point occurs. We have modified our Hadoop job to generate additional output from each mapper, and examined how this affected the running time of the mapper. The original average output bytes for all jobs was Mapper Average Multiplier Mapper Average Output Size (Bytes) (5) Mapper Average Length (Milliseconds) Figure 13: Effect of mapper output on job length We have then extrapolated the ratio between output multiplication factor and the running time increase that would be required to produce it. Figure 14: Estimating the running time growth as a function of the mapper output size

19 The map average time equals to ( i) , where i is the multiplication factor of the mapper average output size. Next we have inputted the function into our mathematical model and iterated over different multiplication factors in order to find if such a tipping point exists and if it does when it does occurs. We have iterated from 1 (no multiplication) to 1,000,000 and no such tipping point occurred. This brought us to the conclusion that for job with a nature such as word count there is no tipping point, i.e. in the map phase the mapper average time will always be bigger than the shuffle time. Summary In this paper we aimed to investigate the deployment of a Hadoop cluster in a virtualized data center environment. We have begun by developing a tool that would augment Hadoop's counter metric gathering system in order to gather additional statistics that Hadoop itself does not collect. Using this system, we have been able to evaluate the network performance of a Hadoop deployment on Amazon and gain insight into the amount of traffic Hadoop generates and the network resources provided by Amazon EC2. Following this first step, we have discovered that Hadoop's network utilization is low and have decided to try and deploy Hadoop between two data centers connected by the public Internet. This deployment took place on Elastichosts and we can now conclude that unmodified Hadoop code can indeed be deployed in such a cluster. With the measurements gained from both deployments, we have then created a mathematical model which allows for the approximation of a Hadoop job's runtime. Using this model, we were able to conclude that Hadoop jobs are not, typically, network constrained and even very large (for example, an order of magnitude) increase in mapper output (which is the most significant source of data transfers in a Hadoop job) does not affect running time significantly (or if it does, then the increase of the reducer and mapper running times is still the dominating factor). This is likely our most interesting result, which opens the way to additional research into Hadoop deployments across the Internet and other research testbeds. Acknowledgements

20 This work has been supported by the LAWA project, an EC collaborative research project (number ) on Longitudinal Analytics of Web Archive Data which is a part of the FIRE ("Future Internet Research and Experimentation") portfolio of ICT research supported by the EC. References [1] Foundation, The Apache Software, "Who Uses Apache Hadoop," [Online]. Available: [2] J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," in MapReduce: Simplified Data Processing on Large Clusters, ACM, [3] Amazon, "Amazon Elastic Compute Cloud (Amazon EC2)," [Online]. Available: [4] ElasticHosts, "Elsatichosts Cloud servers," [Online]. Available: [5] M. D. Hill, The Datacenter as a Computer, Morgan & Claypool, [6] J. Dean and S. Ghemawat, "MapReduce: A Flexible Data Processing Tool," vol. 53, no. 1, [7] J. Xie, S. Yin, X. Ruan, Z. Ding, Y. Tian, J. Majors, A. Manzanares and X. Qin, Improving MapReduce Performance through Data Placement in Hetrogeneous Hadoop Clusters, Department of Computer Science and Software Engineering, Auburn University, Auburn, AL. [8] M. K. Horacio GonzáLez-VéLez, "Performance evaluation of MapReduce using full virtualisation on a departmental cloud," International Journal of Applied Mathematics and Computer Science, vol. 21, no. 2, pp , June 2011.

21 [9] K. Kambatla, A. Pathak and H. Pucha, "Towards optimizing hadoop provisioning in the cloud.," in Proc. of the First Workshop on Hot Topics in Cloud Computing, [10] N. B. Rizvandi, J. Taheri, R. Moraveji and A. Y. Zomaya, "Network Load Analysis and Provisioning of MapReduce Applications," [11] A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel and S. Sengupta, "VL2: A Scalable and Flexible DataCenter Network," vol. 54, no. 3, [12] S. Kandula, S. Sengupta, A. Greenberg, P. Patel and R. Chaiken, "The nature of data center traffic: measurements & analysis," in Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference, New York, NY, [13] M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang and A. Vahdat, "Hedera: dynamic flow scheduling for data center networks," in NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation, USENIX Association Berkeley, CA, USA, [14] K. Church, A. Greenberg and J. Hamilton, "On delivering embarrassingly distributed cloud services," [15] Project Gutenberg, "Project Gutenberg," [Online]. Available:

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Praveenkumar Kondikoppa, Chui-Hui Chiu, Cheng Cui, Lin Xue and Seung-Jong Park Department of Computer Science,

More information

Improving MapReduce Performance in Heterogeneous Environments

Improving MapReduce Performance in Heterogeneous Environments UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University of California at Berkeley Motivation 1. MapReduce

More information

Evaluating HDFS I/O Performance on Virtualized Systems

Evaluating HDFS I/O Performance on Virtualized Systems Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang xtang@cs.wisc.edu University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing

More information

Load Balancing Mechanisms in Data Center Networks

Load Balancing Mechanisms in Data Center Networks Load Balancing Mechanisms in Data Center Networks Santosh Mahapatra Xin Yuan Department of Computer Science, Florida State University, Tallahassee, FL 33 {mahapatr,xyuan}@cs.fsu.edu Abstract We consider

More information

Advanced Computer Networks. Scheduling

Advanced Computer Networks. Scheduling Oriana Riva, Department of Computer Science ETH Zürich Advanced Computer Networks 263-3501-00 Scheduling Patrick Stuedi, Qin Yin and Timothy Roscoe Spring Semester 2015 Outline Last time Load balancing

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Mobile Cloud Computing for Data-Intensive Applications

Mobile Cloud Computing for Data-Intensive Applications Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, vct@andrew.cmu.edu Advisor: Professor Priya Narasimhan, priya@cs.cmu.edu Abstract The computational and storage

More information

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems Aysan Rasooli Department of Computing and Software McMaster University Hamilton, Canada Email: rasooa@mcmaster.ca Douglas G. Down

More information

Pepper: An Elastic Web Server Farm for Cloud based on Hadoop. Subramaniam Krishnan, Jean Christophe Counio Yahoo! Inc. MAPRED 1 st December 2010

Pepper: An Elastic Web Server Farm for Cloud based on Hadoop. Subramaniam Krishnan, Jean Christophe Counio Yahoo! Inc. MAPRED 1 st December 2010 Pepper: An Elastic Web Server Farm for Cloud based on Hadoop Subramaniam Krishnan, Jean Christophe Counio. MAPRED 1 st December 2010 Agenda Motivation Design Features Applications Evaluation Conclusion

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be

More information

International Journal of Emerging Technology in Computer Science & Electronics (IJETCSE) ISSN: 0976-1353 Volume 8 Issue 1 APRIL 2014.

International Journal of Emerging Technology in Computer Science & Electronics (IJETCSE) ISSN: 0976-1353 Volume 8 Issue 1 APRIL 2014. IMPROVING LINK UTILIZATION IN DATA CENTER NETWORK USING NEAR OPTIMAL TRAFFIC ENGINEERING TECHNIQUES L. Priyadharshini 1, S. Rajanarayanan, M.E (Ph.D) 2 1 Final Year M.E-CSE, 2 Assistant Professor 1&2 Selvam

More information

Figure 1. The cloud scales: Amazon EC2 growth [2].

Figure 1. The cloud scales: Amazon EC2 growth [2]. - Chung-Cheng Li and Kuochen Wang Department of Computer Science National Chiao Tung University Hsinchu, Taiwan 300 shinji10343@hotmail.com, kwang@cs.nctu.edu.tw Abstract One of the most important issues

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Hadoop Parallel Data Processing

Hadoop Parallel Data Processing MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for

More information

Virtualizing Apache Hadoop. June, 2012

Virtualizing Apache Hadoop. June, 2012 June, 2012 Table of Contents EXECUTIVE SUMMARY... 3 INTRODUCTION... 3 VIRTUALIZING APACHE HADOOP... 4 INTRODUCTION TO VSPHERE TM... 4 USE CASES AND ADVANTAGES OF VIRTUALIZING HADOOP... 4 MYTHS ABOUT RUNNING

More information

Importance of Data locality

Importance of Data locality Importance of Data Locality - Gerald Abstract Scheduling Policies Test Applications Evaluation metrics Tests in Hadoop Test environment Tests Observations Job run time vs. Mmax Job run time vs. number

More information

Residual Traffic Based Task Scheduling in Hadoop

Residual Traffic Based Task Scheduling in Hadoop Residual Traffic Based Task Scheduling in Hadoop Daichi Tanaka University of Tsukuba Graduate School of Library, Information and Media Studies Tsukuba, Japan e-mail: s1421593@u.tsukuba.ac.jp Masatoshi

More information

Data Center Network Topologies: FatTree

Data Center Network Topologies: FatTree Data Center Network Topologies: FatTree Hakim Weatherspoon Assistant Professor, Dept of Computer Science CS 5413: High Performance Systems and Networking September 22, 2014 Slides used and adapted judiciously

More information

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

More information

A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs In a Workflow Application

A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs In a Workflow Application 2012 International Conference on Information and Computer Applications (ICICA 2012) IPCSIT vol. 24 (2012) (2012) IACSIT Press, Singapore A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs

More information

Efficient Support of Big Data Storage Systems on the Cloud

Efficient Support of Big Data Storage Systems on the Cloud Efficient Support of Big Data Storage Systems on the Cloud Akshay MS, Suhas Mohan, Vincent Kuri, Dinkar Sitaram, H. L. Phalachandra PES Institute of Technology, CSE Dept., Center for Cloud Computing and

More information

StACC: St Andrews Cloud Computing Co laboratory. A Performance Comparison of Clouds. Amazon EC2 and Ubuntu Enterprise Cloud

StACC: St Andrews Cloud Computing Co laboratory. A Performance Comparison of Clouds. Amazon EC2 and Ubuntu Enterprise Cloud StACC: St Andrews Cloud Computing Co laboratory A Performance Comparison of Clouds Amazon EC2 and Ubuntu Enterprise Cloud Jonathan S Ward StACC (pronounced like 'stack') is a research collaboration launched

More information

Best Practices for Monitoring Databases on VMware. Dean Richards Senior DBA, Confio Software

Best Practices for Monitoring Databases on VMware. Dean Richards Senior DBA, Confio Software Best Practices for Monitoring Databases on VMware Dean Richards Senior DBA, Confio Software 1 Who Am I? 20+ Years in Oracle & SQL Server DBA and Developer Worked for Oracle Consulting Specialize in Performance

More information

Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers

Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,

More information

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Yan Fisher Senior Principal Product Marketing Manager, Red Hat Rohit Bakhshi Product Manager,

More information

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

Deploying Flash- Accelerated Hadoop with InfiniFlash from SanDisk

Deploying Flash- Accelerated Hadoop with InfiniFlash from SanDisk WHITE PAPER Deploying Flash- Accelerated Hadoop with InfiniFlash from SanDisk 951 SanDisk Drive, Milpitas, CA 95035 2015 SanDisk Corporation. All rights reserved. www.sandisk.com Table of Contents Introduction

More information

GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

More information

LCMON Network Traffic Analysis

LCMON Network Traffic Analysis LCMON Network Traffic Analysis Adam Black Centre for Advanced Internet Architectures, Technical Report 79A Swinburne University of Technology Melbourne, Australia adamblack@swin.edu.au Abstract The Swinburne

More information

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

More information

Efficient Data Replication Scheme based on Hadoop Distributed File System

Efficient Data Replication Scheme based on Hadoop Distributed File System , pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,

More information

Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing

Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing Hsin-Wen Wei 1,2, Che-Wei Hsu 2, Tin-Yu Wu 3, Wei-Tsong Lee 1 1 Department of Electrical Engineering, Tamkang University

More information

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000 Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000 Alexandra Carpen-Amarie Diana Moise Bogdan Nicolae KerData Team, INRIA Outline

More information

Hadoop Kelvin An Overview

Hadoop Kelvin An Overview Hadoop Kelvin An Overview Aviad Pines and Lev Faerman, HUJI LAWA Group Introduction: This document outlines the Hadoop Kelvin monitoring system, and the benefits it brings to a Hadoop cluster. Why Hadoop

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

CiteSeer x in the Cloud

CiteSeer x in the Cloud Published in the 2nd USENIX Workshop on Hot Topics in Cloud Computing 2010 CiteSeer x in the Cloud Pradeep B. Teregowda Pennsylvania State University C. Lee Giles Pennsylvania State University Bhuvan Urgaonkar

More information

Optimizing Hadoop Block Placement Policy & Cluster Blocks Distribution

Optimizing Hadoop Block Placement Policy & Cluster Blocks Distribution International Journal of Computer, Electrical, Automation, Control and Information Engineering Vol:6, No:1, 212 Optimizing Hadoop Block Placement Policy & Cluster Blocks Distribution Nchimbi Edward Pius,

More information

Virtual Machine Based Resource Allocation For Cloud Computing Environment

Virtual Machine Based Resource Allocation For Cloud Computing Environment Virtual Machine Based Resource Allocation For Cloud Computing Environment D.Udaya Sree M.Tech (CSE) Department Of CSE SVCET,Chittoor. Andra Pradesh, India Dr.J.Janet Head of Department Department of CSE

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

CIT 668: System Architecture

CIT 668: System Architecture CIT 668: System Architecture Data Centers II Topics 1. Containers 2. Data Center Network 3. Reliability 4. Economics Containers 1 Containers Data Center in a shipping container. 4-10X normal data center

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Introduction to Hadoop

Introduction to Hadoop 1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Guidelines for Selecting Hadoop Schedulers based on System Heterogeneity

Guidelines for Selecting Hadoop Schedulers based on System Heterogeneity Noname manuscript No. (will be inserted by the editor) Guidelines for Selecting Hadoop Schedulers based on System Heterogeneity Aysan Rasooli Douglas G. Down Received: date / Accepted: date Abstract Hadoop

More information

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II) UC BERKELEY Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II) Anthony D. Joseph LASER Summer School September 2013 My Talks at LASER 2013 1. AMP Lab introduction 2. The Datacenter

More information

Giving life to today s media distribution services

Giving life to today s media distribution services Giving life to today s media distribution services FIA - Future Internet Assembly Athens, 17 March 2014 Presenter: Nikolaos Efthymiopoulos Network architecture & Management Group Copyright University of

More information

OPTIMIZING PERFORMANCE IN AMAZON EC2 INTRODUCTION: LEVERAGING THE PUBLIC CLOUD OPPORTUNITY WITH AMAZON EC2. www.boundary.com

OPTIMIZING PERFORMANCE IN AMAZON EC2 INTRODUCTION: LEVERAGING THE PUBLIC CLOUD OPPORTUNITY WITH AMAZON EC2. www.boundary.com OPTIMIZING PERFORMANCE IN AMAZON EC2 While the business decision to migrate to Amazon public cloud services can be an easy one, tracking and managing performance in these environments isn t so clear cut.

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

The Hadoop Framework

The Hadoop Framework The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen nils.braden@mni.fh-giessen.de Abstract. The Hadoop Framework offers an approach to large-scale

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

MapReduce (in the cloud)

MapReduce (in the cloud) MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo Motivation:

More information

Paolo Costa costa@imperial.ac.uk

Paolo Costa costa@imperial.ac.uk joint work with Ant Rowstron, Austin Donnelly, and Greg O Shea (MSR Cambridge) Hussam Abu-Libdeh, Simon Schubert (Interns) Paolo Costa costa@imperial.ac.uk Paolo Costa CamCube - Rethinking the Data Center

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Exploiting Cloud Heterogeneity for Optimized Cost/Performance MapReduce Processing

Exploiting Cloud Heterogeneity for Optimized Cost/Performance MapReduce Processing Exploiting Cloud Heterogeneity for Optimized Cost/Performance MapReduce Processing Zhuoyao Zhang University of Pennsylvania, USA zhuoyao@seas.upenn.edu Ludmila Cherkasova Hewlett-Packard Labs, USA lucy.cherkasova@hp.com

More information

Application Development. A Paradigm Shift

Application Development. A Paradigm Shift Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the

More information

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Detection of Distributed Denial of Service Attack with Hadoop on Live Network Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,

More information

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Dharmendra Agawane 1, Rohit Pawar 2, Pavankumar Purohit 3, Gangadhar Agre 4 Guide: Prof. P B Jawade 2

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

GeoCloud Project Report USGS/EROS Spatial Data Warehouse Project

GeoCloud Project Report USGS/EROS Spatial Data Warehouse Project GeoCloud Project Report USGS/EROS Spatial Data Warehouse Project Description of Application The Spatial Data Warehouse project at the USGS/EROS distributes services and data in support of The National

More information

Scalable Multiple NameNodes Hadoop Cloud Storage System

Scalable Multiple NameNodes Hadoop Cloud Storage System Vol.8, No.1 (2015), pp.105-110 http://dx.doi.org/10.14257/ijdta.2015.8.1.12 Scalable Multiple NameNodes Hadoop Cloud Storage System Kun Bi 1 and Dezhi Han 1,2 1 College of Information Engineering, Shanghai

More information

Hadoop: Embracing future hardware

Hadoop: Embracing future hardware Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop

More information

Cloud computing doesn t yet have a

Cloud computing doesn t yet have a The Case for Cloud Computing Robert L. Grossman University of Illinois at Chicago and Open Data Group To understand clouds and cloud computing, we must first understand the two different types of clouds.

More information

Graph Database Proof of Concept Report

Graph Database Proof of Concept Report Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment

More information

The Availability of Commercial Storage Clouds

The Availability of Commercial Storage Clouds The Availability of Commercial Storage Clouds Literature Study Introduction to e-science infrastructure 2008-2009 Arjan Borst ccn 0478199 Grid Computing - University of Amsterdam Software Engineer - WireITup

More information

Hadoop Cluster Applications

Hadoop Cluster Applications Hadoop Overview Data analytics has become a key element of the business decision process over the last decade. Classic reporting on a dataset stored in a database was sufficient until recently, but yesterday

More information

AKAMAI WHITE PAPER. Delivering Dynamic Web Content in Cloud Computing Applications: HTTP resource download performance modelling

AKAMAI WHITE PAPER. Delivering Dynamic Web Content in Cloud Computing Applications: HTTP resource download performance modelling AKAMAI WHITE PAPER Delivering Dynamic Web Content in Cloud Computing Applications: HTTP resource download performance modelling Delivering Dynamic Web Content in Cloud Computing Applications 1 Overview

More information

What is Analytic Infrastructure and Why Should You Care?

What is Analytic Infrastructure and Why Should You Care? What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group grossman@uic.edu ABSTRACT We define analytic infrastructure to be the services,

More information

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud) Open Cloud System (Integration of Eucalyptus, Hadoop and into deployment of University Private Cloud) Thinn Thu Naing University of Computer Studies, Yangon 25 th October 2011 Open Cloud System University

More information

MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially

More information

Big Data on Microsoft Platform

Big Data on Microsoft Platform Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4

More information

Survey on Scheduling Algorithm in MapReduce Framework

Survey on Scheduling Algorithm in MapReduce Framework Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services

More information

Networking Architectures for Big-Data Applications

Networking Architectures for Big-Data Applications Networking Architectures for Big-Data Applications Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu Eighth Annual Microsoft Research Networking Summit Woodinville,

More information

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data

More information

An improved task assignment scheme for Hadoop running in the clouds

An improved task assignment scheme for Hadoop running in the clouds Dai and Bassiouni Journal of Cloud Computing: Advances, Systems and Applications 2013, 2:23 RESEARCH An improved task assignment scheme for Hadoop running in the clouds Wei Dai * and Mostafa Bassiouni

More information

APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM

APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM 152 APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM A1.1 INTRODUCTION PPATPAN is implemented in a test bed with five Linux system arranged in a multihop topology. The system is implemented

More information

A Talari Networks White Paper. Turbo Charging WAN Optimization with WAN Virtualization. A Talari White Paper

A Talari Networks White Paper. Turbo Charging WAN Optimization with WAN Virtualization. A Talari White Paper A Talari Networks White Paper Turbo Charging WAN Optimization with WAN Virtualization A Talari White Paper 2 Introduction WAN Virtualization is revolutionizing Enterprise Wide Area Network (WAN) economics,

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

A Survey of Cloud Computing Guanfeng Octides

A Survey of Cloud Computing Guanfeng Octides A Survey of Cloud Computing Guanfeng Nov 7, 2010 Abstract The principal service provided by cloud computing is that underlying infrastructure, which often consists of compute resources like storage, processors,

More information

Matchmaking: A New MapReduce Scheduling Technique

Matchmaking: A New MapReduce Scheduling Technique Matchmaking: A New MapReduce Scheduling Technique Chen He Ying Lu David Swanson Department of Computer Science and Engineering University of Nebraska-Lincoln Lincoln, U.S. {che,ylu,dswanson}@cse.unl.edu

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

See Spot Run: Using Spot Instances for MapReduce Workflows

See Spot Run: Using Spot Instances for MapReduce Workflows See Spot Run: Using Spot Instances for MapReduce Workflows Navraj Chohan Claris Castillo Mike Spreitzer Malgorzata Steinder Asser Tantawi Chandra Krintz IBM Watson Research Hawthorne, New York Computer

More information

Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems

Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems 215 IEEE International Conference on Big Data (Big Data) Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems Guoxin Liu and Haiying Shen and Haoyu Wang Department of Electrical

More information

http://www.wordle.net/

http://www.wordle.net/ Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

More information

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk. Load Rebalancing for Distributed File Systems in Clouds. Smita Salunkhe, S. S. Sannakki Department of Computer Science and Engineering KLS Gogte Institute of Technology, Belgaum, Karnataka, India Affiliated

More information

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct

More information

Proceedings of the Federated Conference on Computer Science and Information Systems pp. 737 741

Proceedings of the Federated Conference on Computer Science and Information Systems pp. 737 741 Proceedings of the Federated Conference on Computer Science and Information Systems pp. 737 741 ISBN 978-83-60810-22-4 DCFMS: A Chunk-Based Distributed File System for Supporting Multimedia Communication

More information

CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING

CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING Journal homepage: http://www.journalijar.com INTERNATIONAL JOURNAL OF ADVANCED RESEARCH RESEARCH ARTICLE CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING R.Kohila

More information

Hadoop Technology for Flow Analysis of the Internet Traffic

Hadoop Technology for Flow Analysis of the Internet Traffic Hadoop Technology for Flow Analysis of the Internet Traffic Rakshitha Kiran P PG Scholar, Dept. of C.S, Shree Devi Institute of Technology, Mangalore, Karnataka, India ABSTRACT: Flow analysis of the internet

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

Lifetime Management of Cache Memory using Hadoop Snehal Deshmukh 1 Computer, PGMCOE, Wagholi, Pune, India

Lifetime Management of Cache Memory using Hadoop Snehal Deshmukh 1 Computer, PGMCOE, Wagholi, Pune, India Volume 3, Issue 1, January 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com ISSN:

More information

Big Data and Natural Language: Extracting Insight From Text

Big Data and Natural Language: Extracting Insight From Text An Oracle White Paper October 2012 Big Data and Natural Language: Extracting Insight From Text Table of Contents Executive Overview... 3 Introduction... 3 Oracle Big Data Appliance... 4 Synthesys... 5

More information

A Performance Analysis of Distributed Indexing using Terrier

A Performance Analysis of Distributed Indexing using Terrier A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search

More information