Tech Report TR-WP Analyzing Virtualized Datacenter Hadoop Deployments Version 1.0
|
|
- Wilfred Whitehead
- 8 years ago
- Views:
Transcription
1 Longitudinal Analytics of Web Archive data European Commission Seventh Framework Programme Call: FP7-ICT , Activity: ICT Contract No: Tech Report TR-WP Analyzing Virtualized Datacenter Hadoop Deployments Version 1.0 Editor: Work Package: Status: Aviad Pines WP3 Final Date: Dissemination Level: Public
2 Tech Report TR-WP Analyzing Virtualized Datacenter Hadoop Deployments Project Overview Project Name: LAWA Longitudinal Analytics of Web Archive data Call Identifier: FP7-ICT Activity Code: ICT Contract No: Partners: 1. Coordinator: Max-Planck-Institut für Informatik (MPG), Germany 2. Hebrew University of Jerusalem (HUJI), Israel 3. European Archive Foundation (EA), Netherlands 4. Hungarian Academy of Sciences (MTA-SZTAKI), Hungary 5. Hanzo Archives Limited (HANZO), United Kingdom 6. University of Patras (UP), Greece Document Control Title: Analyzing Virtualized Datacenter Hadoop Deployments Author/Editor: Aviad Pines Document History Version Date Author/editor Description/comments Aviad Pines Master thesis
3 Contents Abstract... 4 Introduction... 4 Related Work and Anticipated Impact... 5 Methodology... 6 Amazon Deployments... 7 Elastichosts Deployments... 9 A Mathematical Model of Computation Time Map Phase Tipping Point Summary Acknowledgements References... 20
4 Abstract This paper discusses the performance of Hadoop deployments on virtualized Data Centers such as Amazon EC2 and Elastichosts, both when the Hadoop cluster is located in a single data center, and when it is spread in a cross-datacenter deployment. We analyze the impact of bandwidth between nodes on cluster performance. Introduction Map Reduce is a programming model for processing and generating large data sets. It is in common use by the commercial sector, as well as researchers, to process large quantities of data. Among the users of Map Reduce one can find Amazon, Facebook and Yahoo [1] [2]. The typical Hadoop deployment takes place in a server room, or in large deployments, in a data center. In this paper, we aimed to analyze the performance of Hadoop deployments of various sizes in the virtualized data centers provided by two cloud server providers: Amazon [3] and Elastichosts [4]. In addition to investigating Hadoop performance in the cloud, we have deployed Hadoop in between two of Elastichosts data centers, and have investigated Hadoop performance across the Internet. The different bandwidth speeds between server, local rack and cluster are a known fact [5], and we have at first expected an order of magnitude drop in bandwidth between a single datacenter and an inter-datacenter Hadoop deployment (both due to the pattern of an order of magnitude drop by distance and by our experience with Internet speeds). We have also expected that Hadoop will have difficulty coping with the larger latency introduced by an Internet link. While we have indeed observed the expected drop in bandwidth, we were surprised to discover that Hadoop does indeed run across the open Internet (at least between data centers).
5 The ability to run Hadoop deployments over multiple datacenters has an obvious advantage. Companies can utilize a number of their datacenters in order to run large jobs without being confined to a single location, thus reducing the amount of time it takes to complete a job. In addition data collection can occur in separate locations in the company's geography. Enable computation among the distinct location avoid the need to move the data, costing the company more money for the extra storage. After analyzing both deployments (Amazon and Elastichosts) we have used the data to create a mathematical model of a Hadoop job execution time. We have used this model to investigate the effects of various variables on the job execution time. Related Work and Anticipated Impact Previous work has showed that MapReduce offers a flexible, and effective tool for data processing, proving to be efficient, fault tolerant and highly scalable [6] [2]. It should be noted that since Hadoop's implementation favors homogeneous cluster structure and not deploying it in such a manner so may impede performance work has been carried out regarding the improvement of performance in scenarios where the cluster is heterogeneous [7]. Performance studies of Hadoop have also been conducted, in both regular and virtualized environments [8] [9], but they seem to mostly focus on the total job running times and efficient scheduling of jobs and tasks, and ignore the underlying network performance as uninteresting. A more recent work has been published that studies the effects of network traffic in the confines of a small, 5-node Hadoop cluster [10]. Aside of Hadoop performance, schedulers and protocols for networking traffic within datacenters have also been designed to improve networking performance in data centers and studies have been carried out as to characterize the networking traffic of large data centers [11] [12] [13]. Additionally, work has been done to investigate the advantages of the shift from a monolithic, large, single-site datacenter to smaller, more distributed clouds [14]. It is our aim to expand upon this base of knowledge by contributing input as to the network traffic generated by Hadoop and allow insight into how the network performance potentially affects Hadoop performance. We also aim to show that deploying Hadoop across the Internet is a real possibility.
6 Methodology Hadoop itself provides various metrics about its performance, each task logs its start and end times, the amount of data it has consumed and produced and the amount of time that it ran. This is useful but insufficient since the bandwidth and the idle time statistics of the network is not measured. In order to obtain more detailed metrics we have implemented a Statistic Server which is an additional Hadoop component that serves as a sink for traffic reports. Traffic reports are small descriptors sent from each node in the cluster which detail network transfers by their starting timestamp, duration, size, and type (HDFS Reads, Reducer Inputs or HDFS Writes). By using these traffic reports we were able to measure both the bandwidth at various parts of the job, the typical size of a transfer, and also the total time the network was idle of Hadoop traffic. Care was taken to reduce the volume of this traffic so as to not affect the actual running time of Hadoop jobs on the modified cluster. The statistics server does not provide us the ability to see management traffic, e.g. traffic sent from the Name Node in order to control the execution flow of the job. Since the management traffic is negligible when compared to the rest of the traffic of the Hadoop job, therefore it has no effect on the cluster performance. Our job of choice was a simple word count benchmark run on the Gutenberg dataset, which is a collection of about 23,000 free books in plain text format (totaling about 10GB of text) and is available from the Gutenberg Project [15]. Our virtualized clusters were of two types: The Amazon EC2 cluster used m1.large instances, which provide 2 computational cores and 7.5GB of RAM. On Elastichosts we ve specified machines with a single core, 2GB of RAM and a clock setting of 2,000 MHz. In both cases, however, we do not really know which virtualization setup our cluster actually runs on, the load of other VMs on the physical host, or the general networking load in the data center. This is the nature of the cloud environment. For our cluster configuration, we have specified that each machine will have a number of map and reduce slots identical to the number of virtualized cores that it is configured with.
7 When calculating our networking metrics we have taken care to omit any transfer which has its source and destination on the same machine. Such transfers are handled by the OS and Hypervisor protocol stacks and do not use any actual networking resources in the data center itself. As such, they do not affect the network load and the link speed does not affect them. Amazon Deployments First, we deployed our modified Hadoop onto Amazon EC2. We used EC2 VMs for this, and have not used Elastic Map Reduce as we wanted to manage our machines by ourselves and wanted to use a custom Hadoop version which included our support for statistic gathering. Elastic Map Reduce allows users to execute map-reduce jobs without tinkering with a cluster configuration of the management of individual hosts. The cluster size ranged from 8 to 128 machines with two cores each. Overall, we wanted to see whether network performance affects the running time of Hadoop jobs and whether running Hadoop over the Internet is possible. At first we wanted to see whether the number of machines involved in the computation would affect the performance of the data center network. The following graph illustrates the effect that the Hadoop cluster size has on the average bandwidth of a network path between two machines in the virtualized datacenter.
8 Figure 1: Mean bandwidth per number of hosts on Amazon EC2 Figure 2: Running time per number of cores on Amazon EC2
9 Figure 3: Cumulative transfers for Gutenberg job on Amazon EC2, 64 hosts (128 cores) It is rather interesting to note that there are very few transfers of map inputs. Indeed, the number of non-local map tasks is extremely small. This is due to the uniform distribution of input blocks on machines (giving each machine the same amount of local input) and the homogeneity of machines across the Hadoop cluster (giving each machine similar processing power to crunch through its local maps). Indeed, the bulk of transfers occur in the shuffle phase, where map outputs get transferred to their reducers. Elastichosts Deployments Our second set of experiments centered on benchmarking Hadoop in an inter-datacenter deployment. We have deployed 4 and 16 slave clusters across two Elastichosts datacenters: One was located in San Antonio, Texas, while the other in Los Angeles, California. In both cases, half the machines would be located in each of the data centers. This means that in the 4-machines job, approximately 66% of the traffic was sent through the cross-datacenter link while in the 16- machines job approximately 53% of the traffic was sent through it. In both cases, more than half of the traffic was cross-datacenter.
10 Statistics Server Master hadoop1 hadoop3 hadoop2 hadoop4 Figure 4: In the 4-machines deployment, hadoop1 is sending on the intra-datacenter link to hadoop2, and on the inter-datacenter link to hadoop3 and hadoop4, meaning that approximately 66% of its traffic is sent through the inter-datacenter link. The RTT for packets between both data centers was about 45ms on average, gathered from 5000 pings that were sent between data centers. As expected, there is a significant difference between the bandwidth of the two links. An order of magnitude difference was measured between machines not in the same data center and those which are located in the same one. However, the following graph shows us that there is no actual effect on the running time.
11 Figure 5: Running time per number of cores on Amazon EC2 and Elastichosts This is a non-trivial result: The fact that the running time is similar on the 16 core case (with Elastichosts being slightly faster) is not what we expected to see. Network performance between two hosts in the different data centers is comparable to the download speeds of a consumer internet connection, and we expected this to affect computation time, but this was not the case. Upon examining the actual busy-times of the network paths between cluster nodes, we have discovered that network transfers only occurred during, on average, 3% of the computation time on Amazon and 6.3% of the computation time on Elastichosts, with a variance of and respectively. This leaves the network idle for the vast majority of the job s runtime and explains this result. This has led us to try and devise a mathematical model for the computation time of a Hadoop job that would be able to explain this and would enable us to explore the effects of bandwidth on computation time. A Mathematical Model of Computation Time Our mathematical model works as follows. We separate the model into two parts, the mapping and shuffle phase and the reduce phase.
12 The mapping time can be one of two scenarios either the map tasks take longer than the shuffle phase or vice versa. Assuming the former, the mapping time will be completing all the map tasks plus the time of a single shuffle for the last map output. On the other hand, assuming that the shuffle phase takes longer than the map phase, then the time will be the first map task time (since the actual copy of data in the shuffle will only start after it), plus the time that it takes to shuffle all the map tasks in parallel to the reducers. We will take the mapping time as the maximum between the two cases. Let's analyze the first case, where the map tasks take longer than the shuffle. Looking at the mapping phase, and assuming equal distribution of input between all the mappers, the average mapping task will take M m, where m is the average map execution time, M is the total number of N map tasks and N is the number of cores. Now let's look at a single shuffle time. It is the average output of the mapper divided by the number of hosts (all the calculations are done in parallel), and divided by the speed of the network link. Since the cross data center transfer will always be the bottleneck in terms of the network speed, we will use it to see the worst case scenario. Denoting O m as the average output per mapper, l c as the cross data center link speed, we get O m 1 for a N l c single shuffle time. Proceeding to the second case, where the shuffle phase takes longer time then the map tasks time, we have the first map tasks plus all the shuffle time, which is a single shuffle time calculated in the previous section, multiplied by the number of map tasks that will have to be shuffled to the reducers. We will get the following for the second case of the map phase: m + O m N 1 l c M N (1) And the map phase will be the maximum between the two cases: max M N m + O m N 1 l c m + O m N 1 l c M N (2)
13 The reduce phase is the number of reducers we have per host times the average reducer execution time. Denoting R as the total number of reduce tasks and r max as the maximum reducer execution time, we get the following for the reduce phase: R N r max (3) The reason that we are taking the maximum reducer execution time and not the average, is that unlike the mapping phase, where we can assume that Hadoop will uniformly distribute the mapper tasks among the mappers, the reducers accept keys. A single key cannot be distributed among several reducers, making the largest key the bottleneck for the reducer tasks. This makes our complete model of computation to be M = Number of map tasks. N = Number of slots. m = Average map execution time. O m =Average output per mapper. l c = Cross data center link speed. max r max = Maximum reducer execution time. M N m + O m N 1 l c m + O m N 1 l c M N + r max (4) How accurate is our model? We have used the Hadoop metrics to extract the inputs to our equation from some of the jobs we have executed. The results are illustrated in the following table: Total Map/Reduce Slots Estimated Running Time (minutes) Actual Running Time (minutes) Error 4 (Elastichosts) (Elastichosts) (Amazon)
14 32 (Amazon) (Amazon) (Amazon) (Amazon) Figure 6: Comparing mathematical model results with actual running times Where the data collected for each of the variables of the equation were: Average Map Total Total Max Reducer Number Execution Average Mapper Link Speed Number of Number of Execution Time of Cores Time Output (Bytes) (K/sec) Reduce Map Tasks (Milliseconds) (Milliseconds) Tasks Figure 7: Data used to estimate the mathematical model results on Amazon EC2 Total Total Average Map Max Reducer Number of Average Mapper Link Speed Number of Number of Execution Time Execution Time Cores Output (Bytes) (K/sec) Reduce Map Tasks (Milliseconds) (Milliseconds) Tasks Figure 8: Data used to estimate the mathematical model results on Elastichosts As we can see, our model, while not being perfect, does follow the general behavior of the Hadoop clusters. This means that we can explore the effects of the input variables on the (estimated) running times. The most significant difference between the Amazon cluster and the Elastichosts one is that in the latter we have a cross datacenter network link which is significantly slower than the inner datacenter one. Amazon's bandwidth ranges from 4000 to 7000kb/sec. Elastichosts inner
15 datacenter bandwidth ranges from 6000 to 7000kb/sec, and the cross datacenter bandwidth is by an order of magnitude slower it ranges from 560 to 810kb/sec. Total Map/Reduce slots Amazon Mean Bandwidth Elastichosts in datacenter bandwidth Elastichosts cross datacenter bandwidth Figure 9: Comparison of mean bandwidth between Amazon EC2 and Elastichosts In our formula we choose the slowest link, since it will be the one creating the network bottleneck. We tested different parameters to see how the link bandwidth affects the overall job length. Link bandwidth (K/sec) Job Length (Milliseconds) Job Length (Hours) :00: :31: :01: :46: :01: :31: :25: :25: :25: :25: Figure 10: Using the mathematical model to compare the effects of the link strength on the job length We can see that the last significant affect was between the 6k/sec and the 7k/sec, and afterwards the changes are measured only in a few seconds improvement. It seems that network length is not significant at all for the job length, and if you have more than a 7k/sec connection (which is considered slow today even for the average home connection, not mentioning datacenter link
16 speeds) upgrading the link bandwidth will not help speeding up the job length. This discovery is not surprising when taking into consideration the time the network was idle, which was as mentioned before 97% of the job lifetime. Instead of changing the network link, we can increase the mapper output. This will increase the amount of data we need to transfer and is equivalent do changing the network speed. In our jobs the mapper output was bytes. We increase the mapper output in our model to see how would affect the job length. Mapper Output Size (Byres) Job Length (Milliseconds) Job Length (Hours) h 42m 40s 660ms h 42m 40s 660ms h 42m 40s 660ms h 42m 40s 660ms h 42m 40s 660ms h 42m 40s 662ms h 42m 40s 678ms E h 42m 40s 841ms E h 42m 42s 467ms E h 49m 25s 513ms Figure 11: Using the mathematical model to see the effects of mapper output on job length We can see that increasing the mapper output size by seven orders of magnitude increases the job length by less than a second, and only when increasing by eight orders of magnitude you can see a difference that can be measured by more than a couple of minutes. This means that even if we have a job that creates vast amount of data, it will not affect the overall job length. In real jobs however, we expect that increasing the output bytes of the mapper will increase the running time of the reducers, since more input is being sent to them. Under the assumption that the increased output has the same distribution as previously, we can derive the conclusion that each reducer will receive a larger amount of input that is proportional to the increased mapper output. Since the running time is linearly proportional to the input data size [7], we can assume
17 that the running time of each reducer will be increased by the same factor of the mapper output increase. Mapper Output Size Reducer Running Job Length (Byres) Time (ms) (Milliseconds) Job Length (Hours) h 39m 59s 662ms h 39m 59s 676ms h 39m 59s 821ms h 40m 1s 270ms h 40m 15s 760ms h 42m 40s 662ms h 6m 49s 678ms E7 1.61E h 8m 19s 841ms E8 1.61E h 23m 21s 467ms E9 1.61E h 0m 4s 513ms Figure 12: Using the mathematical model to see the effects of mapper output and reducer running time on the job length We can see that the increase in reducer time increase the overall job length, and increasing the data by tenfold only increase the overall running time by less than a factor of two. Map Phase Tipping Point Next thing that we wanted to explore is when the tipping point between the mapper time and shuffle time in the map phase occurs. Looking again at the mathematical model, the map phase consists of
18 max M N m + O m N 1 l c m + O m N 1 l c M N We wanted to explore when that tipping point occurs. We have modified our Hadoop job to generate additional output from each mapper, and examined how this affected the running time of the mapper. The original average output bytes for all jobs was Mapper Average Multiplier Mapper Average Output Size (Bytes) (5) Mapper Average Length (Milliseconds) Figure 13: Effect of mapper output on job length We have then extrapolated the ratio between output multiplication factor and the running time increase that would be required to produce it. Figure 14: Estimating the running time growth as a function of the mapper output size
19 The map average time equals to ( i) , where i is the multiplication factor of the mapper average output size. Next we have inputted the function into our mathematical model and iterated over different multiplication factors in order to find if such a tipping point exists and if it does when it does occurs. We have iterated from 1 (no multiplication) to 1,000,000 and no such tipping point occurred. This brought us to the conclusion that for job with a nature such as word count there is no tipping point, i.e. in the map phase the mapper average time will always be bigger than the shuffle time. Summary In this paper we aimed to investigate the deployment of a Hadoop cluster in a virtualized data center environment. We have begun by developing a tool that would augment Hadoop's counter metric gathering system in order to gather additional statistics that Hadoop itself does not collect. Using this system, we have been able to evaluate the network performance of a Hadoop deployment on Amazon and gain insight into the amount of traffic Hadoop generates and the network resources provided by Amazon EC2. Following this first step, we have discovered that Hadoop's network utilization is low and have decided to try and deploy Hadoop between two data centers connected by the public Internet. This deployment took place on Elastichosts and we can now conclude that unmodified Hadoop code can indeed be deployed in such a cluster. With the measurements gained from both deployments, we have then created a mathematical model which allows for the approximation of a Hadoop job's runtime. Using this model, we were able to conclude that Hadoop jobs are not, typically, network constrained and even very large (for example, an order of magnitude) increase in mapper output (which is the most significant source of data transfers in a Hadoop job) does not affect running time significantly (or if it does, then the increase of the reducer and mapper running times is still the dominating factor). This is likely our most interesting result, which opens the way to additional research into Hadoop deployments across the Internet and other research testbeds. Acknowledgements
20 This work has been supported by the LAWA project, an EC collaborative research project (number ) on Longitudinal Analytics of Web Archive Data which is a part of the FIRE ("Future Internet Research and Experimentation") portfolio of ICT research supported by the EC. References [1] Foundation, The Apache Software, "Who Uses Apache Hadoop," [Online]. Available: [2] J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," in MapReduce: Simplified Data Processing on Large Clusters, ACM, [3] Amazon, "Amazon Elastic Compute Cloud (Amazon EC2)," [Online]. Available: [4] ElasticHosts, "Elsatichosts Cloud servers," [Online]. Available: [5] M. D. Hill, The Datacenter as a Computer, Morgan & Claypool, [6] J. Dean and S. Ghemawat, "MapReduce: A Flexible Data Processing Tool," vol. 53, no. 1, [7] J. Xie, S. Yin, X. Ruan, Z. Ding, Y. Tian, J. Majors, A. Manzanares and X. Qin, Improving MapReduce Performance through Data Placement in Hetrogeneous Hadoop Clusters, Department of Computer Science and Software Engineering, Auburn University, Auburn, AL. [8] M. K. Horacio GonzáLez-VéLez, "Performance evaluation of MapReduce using full virtualisation on a departmental cloud," International Journal of Applied Mathematics and Computer Science, vol. 21, no. 2, pp , June 2011.
21 [9] K. Kambatla, A. Pathak and H. Pucha, "Towards optimizing hadoop provisioning in the cloud.," in Proc. of the First Workshop on Hot Topics in Cloud Computing, [10] N. B. Rizvandi, J. Taheri, R. Moraveji and A. Y. Zomaya, "Network Load Analysis and Provisioning of MapReduce Applications," [11] A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel and S. Sengupta, "VL2: A Scalable and Flexible DataCenter Network," vol. 54, no. 3, [12] S. Kandula, S. Sengupta, A. Greenberg, P. Patel and R. Chaiken, "The nature of data center traffic: measurements & analysis," in Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference, New York, NY, [13] M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang and A. Vahdat, "Hedera: dynamic flow scheduling for data center networks," in NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation, USENIX Association Berkeley, CA, USA, [14] K. Church, A. Greenberg and J. Hamilton, "On delivering embarrassingly distributed cloud services," [15] Project Gutenberg, "Project Gutenberg," [Online]. Available:
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
More informationNetwork-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks
Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Praveenkumar Kondikoppa, Chui-Hui Chiu, Cheng Cui, Lin Xue and Seung-Jong Park Department of Computer Science,
More informationImproving MapReduce Performance in Heterogeneous Environments
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University of California at Berkeley Motivation 1. MapReduce
More informationEvaluating HDFS I/O Performance on Virtualized Systems
Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang xtang@cs.wisc.edu University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing
More informationLoad Balancing Mechanisms in Data Center Networks
Load Balancing Mechanisms in Data Center Networks Santosh Mahapatra Xin Yuan Department of Computer Science, Florida State University, Tallahassee, FL 33 {mahapatr,xyuan}@cs.fsu.edu Abstract We consider
More informationAdvanced Computer Networks. Scheduling
Oriana Riva, Department of Computer Science ETH Zürich Advanced Computer Networks 263-3501-00 Scheduling Patrick Stuedi, Qin Yin and Timothy Roscoe Spring Semester 2015 Outline Last time Load balancing
More informationIntroduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
More informationA Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
More informationIntroduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
More informationMobile Cloud Computing for Data-Intensive Applications
Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, vct@andrew.cmu.edu Advisor: Professor Priya Narasimhan, priya@cs.cmu.edu Abstract The computational and storage
More informationA Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems
A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems Aysan Rasooli Department of Computing and Software McMaster University Hamilton, Canada Email: rasooa@mcmaster.ca Douglas G. Down
More informationPepper: An Elastic Web Server Farm for Cloud based on Hadoop. Subramaniam Krishnan, Jean Christophe Counio Yahoo! Inc. MAPRED 1 st December 2010
Pepper: An Elastic Web Server Farm for Cloud based on Hadoop Subramaniam Krishnan, Jean Christophe Counio. MAPRED 1 st December 2010 Agenda Motivation Design Features Applications Evaluation Conclusion
More informationApache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
More informationReducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan
Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be
More informationInternational Journal of Emerging Technology in Computer Science & Electronics (IJETCSE) ISSN: 0976-1353 Volume 8 Issue 1 APRIL 2014.
IMPROVING LINK UTILIZATION IN DATA CENTER NETWORK USING NEAR OPTIMAL TRAFFIC ENGINEERING TECHNIQUES L. Priyadharshini 1, S. Rajanarayanan, M.E (Ph.D) 2 1 Final Year M.E-CSE, 2 Assistant Professor 1&2 Selvam
More informationFigure 1. The cloud scales: Amazon EC2 growth [2].
- Chung-Cheng Li and Kuochen Wang Department of Computer Science National Chiao Tung University Hsinchu, Taiwan 300 shinji10343@hotmail.com, kwang@cs.nctu.edu.tw Abstract One of the most important issues
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationHadoop Parallel Data Processing
MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for
More informationVirtualizing Apache Hadoop. June, 2012
June, 2012 Table of Contents EXECUTIVE SUMMARY... 3 INTRODUCTION... 3 VIRTUALIZING APACHE HADOOP... 4 INTRODUCTION TO VSPHERE TM... 4 USE CASES AND ADVANTAGES OF VIRTUALIZING HADOOP... 4 MYTHS ABOUT RUNNING
More informationImportance of Data locality
Importance of Data Locality - Gerald Abstract Scheduling Policies Test Applications Evaluation metrics Tests in Hadoop Test environment Tests Observations Job run time vs. Mmax Job run time vs. number
More informationResidual Traffic Based Task Scheduling in Hadoop
Residual Traffic Based Task Scheduling in Hadoop Daichi Tanaka University of Tsukuba Graduate School of Library, Information and Media Studies Tsukuba, Japan e-mail: s1421593@u.tsukuba.ac.jp Masatoshi
More informationData Center Network Topologies: FatTree
Data Center Network Topologies: FatTree Hakim Weatherspoon Assistant Professor, Dept of Computer Science CS 5413: High Performance Systems and Networking September 22, 2014 Slides used and adapted judiciously
More informationEnhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input
More informationA Locality Enhanced Scheduling Method for Multiple MapReduce Jobs In a Workflow Application
2012 International Conference on Information and Computer Applications (ICICA 2012) IPCSIT vol. 24 (2012) (2012) IACSIT Press, Singapore A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs
More informationEfficient Support of Big Data Storage Systems on the Cloud
Efficient Support of Big Data Storage Systems on the Cloud Akshay MS, Suhas Mohan, Vincent Kuri, Dinkar Sitaram, H. L. Phalachandra PES Institute of Technology, CSE Dept., Center for Cloud Computing and
More informationStACC: St Andrews Cloud Computing Co laboratory. A Performance Comparison of Clouds. Amazon EC2 and Ubuntu Enterprise Cloud
StACC: St Andrews Cloud Computing Co laboratory A Performance Comparison of Clouds Amazon EC2 and Ubuntu Enterprise Cloud Jonathan S Ward StACC (pronounced like 'stack') is a research collaboration launched
More informationBest Practices for Monitoring Databases on VMware. Dean Richards Senior DBA, Confio Software
Best Practices for Monitoring Databases on VMware Dean Richards Senior DBA, Confio Software 1 Who Am I? 20+ Years in Oracle & SQL Server DBA and Developer Worked for Oracle Consulting Specialize in Performance
More informationEnhancing MapReduce Functionality for Optimizing Workloads on Data Centers
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,
More informationArchitecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Yan Fisher Senior Principal Product Marketing Manager, Red Hat Rohit Bakhshi Product Manager,
More informationAnalysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
More informationBenchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
More informationDeploying Flash- Accelerated Hadoop with InfiniFlash from SanDisk
WHITE PAPER Deploying Flash- Accelerated Hadoop with InfiniFlash from SanDisk 951 SanDisk Drive, Milpitas, CA 95035 2015 SanDisk Corporation. All rights reserved. www.sandisk.com Table of Contents Introduction
More informationGraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
More informationLCMON Network Traffic Analysis
LCMON Network Traffic Analysis Adam Black Centre for Advanced Internet Architectures, Technical Report 79A Swinburne University of Technology Melbourne, Australia adamblack@swin.edu.au Abstract The Swinburne
More informationThe Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions
More informationLog Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com
More informationEfficient Data Replication Scheme based on Hadoop Distributed File System
, pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,
More informationShareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing
Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing Hsin-Wen Wei 1,2, Che-Wei Hsu 2, Tin-Yu Wu 3, Wei-Tsong Lee 1 1 Department of Electrical Engineering, Tamkang University
More informationLeveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000
Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000 Alexandra Carpen-Amarie Diana Moise Bogdan Nicolae KerData Team, INRIA Outline
More informationHadoop Kelvin An Overview
Hadoop Kelvin An Overview Aviad Pines and Lev Faerman, HUJI LAWA Group Introduction: This document outlines the Hadoop Kelvin monitoring system, and the benefits it brings to a Hadoop cluster. Why Hadoop
More informationThe Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have
More informationCiteSeer x in the Cloud
Published in the 2nd USENIX Workshop on Hot Topics in Cloud Computing 2010 CiteSeer x in the Cloud Pradeep B. Teregowda Pennsylvania State University C. Lee Giles Pennsylvania State University Bhuvan Urgaonkar
More informationOptimizing Hadoop Block Placement Policy & Cluster Blocks Distribution
International Journal of Computer, Electrical, Automation, Control and Information Engineering Vol:6, No:1, 212 Optimizing Hadoop Block Placement Policy & Cluster Blocks Distribution Nchimbi Edward Pius,
More informationVirtual Machine Based Resource Allocation For Cloud Computing Environment
Virtual Machine Based Resource Allocation For Cloud Computing Environment D.Udaya Sree M.Tech (CSE) Department Of CSE SVCET,Chittoor. Andra Pradesh, India Dr.J.Janet Head of Department Department of CSE
More informationKeywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
More informationCIT 668: System Architecture
CIT 668: System Architecture Data Centers II Topics 1. Containers 2. Data Center Network 3. Reliability 4. Economics Containers 1 Containers Data Center in a shipping container. 4-10X normal data center
More informationGraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
More informationIntroduction to Hadoop
1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools
More informationLarge-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
More informationGuidelines for Selecting Hadoop Schedulers based on System Heterogeneity
Noname manuscript No. (will be inserted by the editor) Guidelines for Selecting Hadoop Schedulers based on System Heterogeneity Aysan Rasooli Douglas G. Down Received: date / Accepted: date Abstract Hadoop
More informationMesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)
UC BERKELEY Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II) Anthony D. Joseph LASER Summer School September 2013 My Talks at LASER 2013 1. AMP Lab introduction 2. The Datacenter
More informationGiving life to today s media distribution services
Giving life to today s media distribution services FIA - Future Internet Assembly Athens, 17 March 2014 Presenter: Nikolaos Efthymiopoulos Network architecture & Management Group Copyright University of
More informationOPTIMIZING PERFORMANCE IN AMAZON EC2 INTRODUCTION: LEVERAGING THE PUBLIC CLOUD OPPORTUNITY WITH AMAZON EC2. www.boundary.com
OPTIMIZING PERFORMANCE IN AMAZON EC2 While the business decision to migrate to Amazon public cloud services can be an easy one, tracking and managing performance in these environments isn t so clear cut.
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationThe Hadoop Framework
The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen nils.braden@mni.fh-giessen.de Abstract. The Hadoop Framework offers an approach to large-scale
More informationAssociate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2
Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
More informationMapReduce (in the cloud)
MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo Motivation:
More informationPaolo Costa costa@imperial.ac.uk
joint work with Ant Rowstron, Austin Donnelly, and Greg O Shea (MSR Cambridge) Hussam Abu-Libdeh, Simon Schubert (Interns) Paolo Costa costa@imperial.ac.uk Paolo Costa CamCube - Rethinking the Data Center
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationExploiting Cloud Heterogeneity for Optimized Cost/Performance MapReduce Processing
Exploiting Cloud Heterogeneity for Optimized Cost/Performance MapReduce Processing Zhuoyao Zhang University of Pennsylvania, USA zhuoyao@seas.upenn.edu Ludmila Cherkasova Hewlett-Packard Labs, USA lucy.cherkasova@hp.com
More informationApplication Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
More informationDetection of Distributed Denial of Service Attack with Hadoop on Live Network
Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,
More informationFinding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics
Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Dharmendra Agawane 1, Rohit Pawar 2, Pavankumar Purohit 3, Gangadhar Agre 4 Guide: Prof. P B Jawade 2
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationGeoCloud Project Report USGS/EROS Spatial Data Warehouse Project
GeoCloud Project Report USGS/EROS Spatial Data Warehouse Project Description of Application The Spatial Data Warehouse project at the USGS/EROS distributes services and data in support of The National
More informationScalable Multiple NameNodes Hadoop Cloud Storage System
Vol.8, No.1 (2015), pp.105-110 http://dx.doi.org/10.14257/ijdta.2015.8.1.12 Scalable Multiple NameNodes Hadoop Cloud Storage System Kun Bi 1 and Dezhi Han 1,2 1 College of Information Engineering, Shanghai
More informationHadoop: Embracing future hardware
Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop
More informationCloud computing doesn t yet have a
The Case for Cloud Computing Robert L. Grossman University of Illinois at Chicago and Open Data Group To understand clouds and cloud computing, we must first understand the two different types of clouds.
More informationGraph Database Proof of Concept Report
Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment
More informationThe Availability of Commercial Storage Clouds
The Availability of Commercial Storage Clouds Literature Study Introduction to e-science infrastructure 2008-2009 Arjan Borst ccn 0478199 Grid Computing - University of Amsterdam Software Engineer - WireITup
More informationHadoop Cluster Applications
Hadoop Overview Data analytics has become a key element of the business decision process over the last decade. Classic reporting on a dataset stored in a database was sufficient until recently, but yesterday
More informationAKAMAI WHITE PAPER. Delivering Dynamic Web Content in Cloud Computing Applications: HTTP resource download performance modelling
AKAMAI WHITE PAPER Delivering Dynamic Web Content in Cloud Computing Applications: HTTP resource download performance modelling Delivering Dynamic Web Content in Cloud Computing Applications 1 Overview
More informationWhat is Analytic Infrastructure and Why Should You Care?
What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group grossman@uic.edu ABSTRACT We define analytic infrastructure to be the services,
More informationOpen Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)
Open Cloud System (Integration of Eucalyptus, Hadoop and into deployment of University Private Cloud) Thinn Thu Naing University of Computer Studies, Yangon 25 th October 2011 Open Cloud System University
More informationMapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially
More informationBig Data on Microsoft Platform
Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4
More informationSurvey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services
More informationNetworking Architectures for Big-Data Applications
Networking Architectures for Big-Data Applications Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu Eighth Annual Microsoft Research Networking Summit Woodinville,
More informationTake An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
More informationAn improved task assignment scheme for Hadoop running in the clouds
Dai and Bassiouni Journal of Cloud Computing: Advances, Systems and Applications 2013, 2:23 RESEARCH An improved task assignment scheme for Hadoop running in the clouds Wei Dai * and Mostafa Bassiouni
More informationAPPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM
152 APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM A1.1 INTRODUCTION PPATPAN is implemented in a test bed with five Linux system arranged in a multihop topology. The system is implemented
More informationA Talari Networks White Paper. Turbo Charging WAN Optimization with WAN Virtualization. A Talari White Paper
A Talari Networks White Paper Turbo Charging WAN Optimization with WAN Virtualization A Talari White Paper 2 Introduction WAN Virtualization is revolutionizing Enterprise Wide Area Network (WAN) economics,
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
More informationA Survey of Cloud Computing Guanfeng Octides
A Survey of Cloud Computing Guanfeng Nov 7, 2010 Abstract The principal service provided by cloud computing is that underlying infrastructure, which often consists of compute resources like storage, processors,
More informationMatchmaking: A New MapReduce Scheduling Technique
Matchmaking: A New MapReduce Scheduling Technique Chen He Ying Lu David Swanson Department of Computer Science and Engineering University of Nebraska-Lincoln Lincoln, U.S. {che,ylu,dswanson}@cse.unl.edu
More informationScalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
More informationSee Spot Run: Using Spot Instances for MapReduce Workflows
See Spot Run: Using Spot Instances for MapReduce Workflows Navraj Chohan Claris Castillo Mike Spreitzer Malgorzata Steinder Asser Tantawi Chandra Krintz IBM Watson Research Hawthorne, New York Computer
More informationComputing Load Aware and Long-View Load Balancing for Cluster Storage Systems
215 IEEE International Conference on Big Data (Big Data) Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems Guoxin Liu and Haiying Shen and Haoyu Wang Department of Electrical
More informationhttp://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
More informationIndex Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.
Load Rebalancing for Distributed File Systems in Clouds. Smita Salunkhe, S. S. Sannakki Department of Computer Science and Engineering KLS Gogte Institute of Technology, Belgaum, Karnataka, India Affiliated
More informationAn Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database
An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct
More informationProceedings of the Federated Conference on Computer Science and Information Systems pp. 737 741
Proceedings of the Federated Conference on Computer Science and Information Systems pp. 737 741 ISBN 978-83-60810-22-4 DCFMS: A Chunk-Based Distributed File System for Supporting Multimedia Communication
More informationCURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING
Journal homepage: http://www.journalijar.com INTERNATIONAL JOURNAL OF ADVANCED RESEARCH RESEARCH ARTICLE CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING R.Kohila
More informationHadoop Technology for Flow Analysis of the Internet Traffic
Hadoop Technology for Flow Analysis of the Internet Traffic Rakshitha Kiran P PG Scholar, Dept. of C.S, Shree Devi Institute of Technology, Mangalore, Karnataka, India ABSTRACT: Flow analysis of the internet
More informationHadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
More informationLifetime Management of Cache Memory using Hadoop Snehal Deshmukh 1 Computer, PGMCOE, Wagholi, Pune, India
Volume 3, Issue 1, January 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com ISSN:
More informationBig Data and Natural Language: Extracting Insight From Text
An Oracle White Paper October 2012 Big Data and Natural Language: Extracting Insight From Text Table of Contents Executive Overview... 3 Introduction... 3 Oracle Big Data Appliance... 4 Synthesys... 5
More informationA Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
More information