Pre-Stack Kirchhoff Time Migration on Hadoop and Spark
|
|
|
- Gordon Franklin
- 9 years ago
- Views:
Transcription
1 Pre-Stack Kirchhoff Time Migration on Hadoop and Spark Chen Yang, Jie Tang 1( ), Heng Gao, and Gangshan Wu State Key Laboratory for Novel Software Technology Department of Computer Science and Technology Nanjing University, Nanjing, China, Abstract. Pre-stack Kirchhoff time migration (PKTM) is one of the most widely used migration algorithms in seismic imaging area. However, PKTM takes considerable time due to its high computational cost, which greatly affects the working efficiency of oil industry. Due to its high fault tolerance and scalability, Hadoop has become the most popular platform for big data processing. To overcome the shortcoming too much network traffic and disk I/O in Hadoop, there shows up a new distributed framework Spark. However the behaviour and performance of those two systems when applied to high performance computing are still under investigation. In this paper, we proposed two parallel algorithms of the plre-stack Kirchhoff time migration based on Hadoop and Sark respectively. Experiments are carried out to compare the performances of them. The results show that both of implementations are efficient and scalable and our PKTM on Spark exhibits better performance than the one on Hadoop. Keywords: Big Data; Hadoop; Spark; Kirchhoff; MapReduce; RDD. 1 Introduction Pre-stack Kirchhoff time migration (PKTM)[1] is one of the most popular migration technique in seismic imaging area because of its simplicity, efficiency, feasibility and target-orientated property. However, practical PKTM tasks for large 3D surveys are still computationally intensive and usually running on supercomputers or large PC-cluster systems with high cost for purchasing and maintaining. Nowadays, cloud computing has received a lot of attention from both research and industry due to the deployment and growth of commercial cloud platforms. Compared to other parallel computing solutions such as MPI or GPU, cloud computing has advantages of automatically handling failures, hiding parallel programming caomplexity as well as better system scalability. Thus more and more geologists turn to finding seismic imaging solutions on Hadoop and Spark. Hadoop is an Apache open source distributed computing framework for clusters with inexpensive hardware[2] and is widely used for many different classes of
2 data-intensive applications[3]. The core design of Apache Hadoop[4] is MapReduce and HDFS. MapReduce is a parallel computing framework to run on HDFS, it will abstract the user s program for two processes: Map and Reduce. A key benefit of MapReduce is that it automatically handles failures, hiding the complexity of fault-tolerance from the programmer[5]. Apache Spark, which is developed by UCBerkeley s AMP laboratory, is another fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala and Python, and an optimized engine that supports general execution graphs[6]. Spark is a MapReduce-like cluster computing framework[6 8]. However, unlike Hadoop, Spark enables memory distributed data sets, provides interactive query, optimize iterative workloads. Most importantly, Spark introduces the concept of memory computing(rdd)[3], i.e. data sets can be cached in the memory to shorten the access latency, which is very efficiency for some applications. In this paper, we propose two parallel algorithms of the pre-stack Kirchhoff time migration based on Hadoop and Sark respectively. Experiments are carried out to compare the performances of them. The results show that both of implementations are efficient and scalable while PKTM on Spark exhibits better performance than the one on Hadoop. 2 Related Work Several works exist in the literature with regards to implementation of Kirchhoff on MapReduce framework. Rizvandi[9] introduces an algorithm of PKTM on MapReduce framework, which splits data into traces, sends traces to Map function, then computes traces, shuffles data to Reduce function. The program use MapReduce parallel framework to realize parallelism of the PKTM, but the problem is that the shuffle data is very huge, and seriously affected the performance of the program. Another kind of parallel PKTM uses GPUs to achieve parallelism. Shi[1] presents a PKTM algorithm on GPGPU[10], it uses multi-gpus to compute the data and runs much faster than CPU implementation. However, the memory on GPUs are not large enough. Therefore, when the data gets bigger and bigger and cannot be hold in GPU memory, the data transfer between RAM and GPU memory will be the bottleneck of the program. Li[11] puts forward another P- KTM on GPUs. It s 20 times faster than a pure CPU execution, still the data transfer between RAM and GPU memory as well as loop control are overloaded. Generally, Running PKTM on GPUs will have the problems of data transfer and synchronize between GPUs and CPU. Gao[12] presents a solution utilizing the combination of the GPUs and MapReduce. It can greatly accelerate the execution time. However, it just abstract the Map function to a GPU implementation. If the Map function needs to change, the whole GPU codes must be modified, which is not flexible. So far, there is no Kirchhoff works on Spark.
3 3 Algorithms Kirchhoff migration uses the Huygens-Fresnel principle to collapse all possible contributions to an image sample. Wherever this sum causes constructive interference, the image is heavily marked, remaining blank or loosely marked on destructive interference parts. A contribution to an image sample T(z,x,y,o) is generated by an input sample S(t,x,y,o) whenever the measured signal travel time t matches computed travel time from source to (z,x,y,o) subsurface point and back to receiver. The set of all possible input traces (x,y,o) that may contribute to an output trace (x,y,o) lie within an ellipsis of axis ax and ay (apertures) centered at (x,y,o). Input traces are filtered to attenuate spatial aliasing. Sample amplitudes are corrected to account for energy spreading during propagation. PKTM algorithm and program data flow structure shown in Fig.1. (a) PKTM algorithm (b) RKTM flowchart Fig. 1: PKTM flowchart The schematic of the seismic imaging shown in Fig.2. Fig. 2: Seismic imaging
4 The pseudocode of the Kirchhoff algorithm shown as below: Algorithm 1 Kirchhoff Algorithm 1: procedure Kirchhoff(inputtraces) 2: for all input traces do 3: read input trace 4: filter input trace 5: for all output traces within aperture do 6: for all output trace contributed samples do 7: compute travel time 8: compute amplitude correction 9: select input sample and filter 10: accumulate input sample contribution into output sample 11: end for 12: end for 13: end for 14: dump output volume 15: end procedure 3.1 PKTM on Hadoop We propose an algorithm to carry on PKTM on Hadoop framwork. The steps of the algorithm are as follows: 1. Acquiring cluster environment variables: As we all know, Hadoop framework is closely related to machine configuration. So in the first step, some system variables are detected and stored, including number of nodes n, memory of each node M 1, M 2... M n, number of CPUs cpus, number of cores per CPU cores, threads per core threads. 2. Inputting data: In Hadoop, each input file block corresponds to a mapper. Also according to Hadoop[4] document, a node performs well when the number of mappers running on it are between 10 and 100. Therefore, in order to control the number of mappers, we override the FileInputFormat class, which is in charge for how to logically split the input files. Each split will produce a mapper. Hadoop usually splits a file into default size. The size of the split is key to system efficiency. Hence,we need to re-split the whole file into splits with proper size. Firstly, we read in default splits lengths of the whole file S 1, S 2, S 3... S n. Based on these lengths, we can calculate the total size of the input file as n i=1 S i. The number of reduce tasks r n is set by the user. Then to limit the number of mappers between , the size of a split is computed as follows: f split = ( n i=1 S i) ( k min (cpus cores + r n ), ( n )) i=1 M i M
5 where k [1, ) is a controllable parameter that can be modified. f split must be greater than 0. M is the memory size of a mapper. Finally we set the mapreduce.input.fileinputformat.split.maxsize as f split which decides the split size. Through input split processing, we read in the samples of an input trace, producing < key, value > pairs. key represents the offset of input trace in the input files, value represents the coordinate points of each input trace. One pair may combine many input traces. These pairs will be submitted to the mapper for execution. 3. Mapping: Since the input files are logically divided into E splits, each split data is submitted to a mapper, these mappers will be computing in parallel. Each mapper computes a lot of input traces, and each input trace will produce many output traces, so in order to achieve parallelism within the mapper, we use the multi-threads to process input traces. Each thread shares a data structure HashMap which is used to save the output trace with the same output key, so that the same output trace will be merge locally which greatly decrease the load of the data transmission. We detect the Hyper-Threading mode of the cluster system. If it is on, we set the number of threads of a mapper to 2 (threads 2), else we set it to (threads 2). Another strategy to decrease the load of data transmission is to write map values to the HDFS files, only send < key, filename#offset > pairs. 4. Combining: In this phase, we aggregate the output pairs which are generated by mappers with the same key on the same node and will reduce the network traffic across nodes, thus improve the efficiency of the program. Because in the same mapper and in the same node, there will be many output traces with the same key. This also largely decrease the load of data transmission. 5. Partitioning: The output < key, value > pairs are mapped to reduce nodes with the key. Ensure that all keys are sorted in each reduce node according to its reduce task attempt id. The formula is as follows: key f hash = [( )] keymax reduce id In this way, the final large image data sort time will be reduced. We only need to sort the keys in each reduce task in parallel and this further reduces the application execution time. 6. Reducing: According to the output < key, filename#offset > pairs which are sent by map tasks, we read HDFS files according to filename, offset and aggregate them with the same key comes from different nodes. All the reduce tasks run in parallel. Besides, the number of the reduce tasks is set by the user, user can adjust this number to get the best performance. 7. Outputting: Each reduce task produces an output < key, value > pairs, we sort these keys in each reduce task, then write them to a binary file, the file name contains the minimum key in each reduce task. All the sort operations are run in parallel.
6 8. Image Output: According to the reduce output files, we sort these file names with the minimum key, this only take a little time, then write them into only one image binary file sequentially. This image file is the finally imaging file to show to the professionals. The program s running architecture shown in Fig.3.(a) and its flowchart shown in Fig.3.(b). In the flowchart, the steps exact match the steps in the algorithm above. (a) Running on Hadoop (b) Flowchart of Kirchhoff Fig. 3: Kirchhoff Implementation on Hadoop 3.2 PKTM on Spark Spark provides RDD which shields a lot of interactions with the HDFS and achieve better efficiency for application with lots of iterations. Spark develops the ApplicationMaster which acts much more appropriate with Yarn. Hence we develop a new algorithm with Spark system on Yarn framework. 1. Acquiring cluster environment variables: The program on Spark need to read data from HDFS, we just simply read splits from HDFS using newapi- HadoopFile with whatever splits, producing < key, value > pairs as records of RDD. RDD can be partitioned according to user s wishes. One partition corresponds to an executor which is similar to mapper. In addition, Spark provides the command spark-submit to submit an application. Users can set the number of task executors N, the memory of each executor and the CPU cores of each executor through the command line. This is very convenient compared to Hadoop. So we just read the environment variables from the command line, which is very convenience. Then the partitions of the
7 RDD will be set as: ( f pars = k min N, n i=1 M ) i M min 2. Inputting data: Because we run the Spark on Yarn and HDFS framework, we need to read files from HDFS. Spark provides several functions to read input files from HDFS and return them in RDD model. Each RDD contains many records of < key, value > pairs, key represents the offset of the input trace and value represents the coordinate samples of the corresponding key. One record combines many traces. Moreover, RDD can persist data in memory after the first calculation, so we persist those RDD records in memory and hard disk(if it s too large). This greatly reduces repeatedly reading from HDFS. Then we partition the RDD with the above formula. These partitions will be calculated in parallel. 3. FlatMapping: In Spark, RDD partitions will be sent to executors. Executor starts to calculate a partition. After it s done, it continues with the remaining partitions. All of the executors calculate in parallel. In this map period, we also use the multi-threads to compute the input traces. Whether a cluster system opens Hyper-Threading or not, we just set the threads number as (threads 2) to ensure a good performance, because Spark uses thread model while Hadoop uses process model. In Hadoop, when map function running to a proper percentage(such as 5%, this can be set by the configuration), reduce tasks will be launched to collect output pairs from mappers. But in Spark, reduce tasks wait until all mappers finish and return with RDD. This feature make the program on Spark more efficient as reduce task do not occupy the resources used by mappers. It s worth to mention that the partitions of RDD do not change until you invoke the repartition function. 4. Partitioning: Firstly, we get the total number of output trace onx. Then we divide these keys depending on the number of reduce partitions R n. Smaller keys correspond to the smaller reduce task id. This helps the later sort operation. Each record in RDD applies a mapping operation to choose into which the reduce partition goes. The formula is as follows: key f partition = [( )] onx R n 5. ReduceByKey: Spark ensures that the same key pairs will be sent to the same reduce task. According to this feature, the RDD datasets that come from a map operation are sent to the reduce tasks with the same key. So we aggregate the values by the same key in each reduce task. Each reduce task return an RDD partition which will be aggregated into a total RDD to the user, however, its partitions still exist. The important point of the ReduceByKey function is that the operation firstly merges map output pairs in the same node, then send pairs to the correspond reduce tasks. This feature greatly reduces the load of network traffic.
8 6. SortByKey: This function sorts the keys in each RDD partition which reduce tasks returned. The sort operation is also executed in parallel among different RDD partitions. 7. Image Output: According to the sorted keys, we write the corresponding values to a binary file to HDFS for permanent preservation. The program s running architecture shown in Fig.4.(a) and its flow chart shown in Fig.4.(b). The steps of the flow chart corresponds to the steps of the algorithm above. The program flowchart shown in Fig.4. (a) Running on Spark (b) Flowchart of Kirchhoff Fig. 4: Kirchhoff Implementation on Spark 4 Results and analysis We have run our two implementations on a cluster with six nodes, the details of which are shown in Table.1. Table 1: Cluster Configuration Name CPUs Cores Per CPU Thread Per Core Memory(G) Master Slave Slave Slave Slave Slave
9 4.1 Experiment Configuration The node Master acts as the Hadoop Master node and the remaining nodes act as Hadoop Slave nodes. Master is not only the master of the HDFS framework(namenode), but also the master of the Yarn framework(act as ResourceManager). Five nodes, Slave1, Slave2, Slave3, Slave4, Slave5, are used as the DataNode and the NodeManager. 4.2 Data Preparation We use Sigsbee2 Models to test our proposed PKTM methods on Hadoop and Spark. The model could be downloaded from RSF/book/data/sigsbee/paper html/. PKTM includes the following three main steps: data preprocessing, migration and output. PKTM use two input data file formats including meta files and data files[13]. The input files in this program contains input trace meta file(shot.meta), input trace seismic source meta file(fsxy.meta), input trace detector location meta file(fgxy.meta), input trace center points meta file(fcxy.meta), velocity meta file(rmsv.meta). Each of the meta file has a corresponding data file(*.data), such as shot.data(about 30GB), fsxy.data, fgxy.data, fcxy.data, rmsv.data. 4.3 Experimental Results We experiment our program from the following aspects: (1). Experiments on Hadoop: a. We firstly test how the memory of a mapper s container affects the performance of the program. The test results shown in Fig.5.(a). It can be seen that when the consumed memory exceeds over some threshold, the number of mappers reduces accordingly, which causes the execution time getting longer. The reason is that the number of active parallel tasks is constrained by the the total memory of these mappers. b. Secondly, we test how the number of mappers affects the performance. The test results are shown in Fig.5.(b). In the figure, it can be seen that the executing time is smallest when the mapper s number fits to the cluster s resources. If the number of mappers is small, the capacity of the cluster is not fully utilized, and results in a longer execution time. If the number of mappers is too large, the scheduling time usually will increase, which also results in a longer execution time. c. Finally, in Hadoop, the cluster will start Reduce tasks to receive map output files even when some Mappers still not finished. In this situation, the started Reduce tasks occupy the resources of memory and CPUs which affects other mappers execution. Therefore, we try to test how reduce task numbers affects the performance. The results are shown
10 in Fig.5.(c). In the figure, it can be seen that when the numbers of reduce tasks match the number and output pairs of mappers, the best performance will be achieved. 800 Task Memory Allocation Hadoop 240 Map Numbers Hadoop Time(s) 400 Time(s) Memory(G) Mappers) (a) Map Task Memory (b) Mappers numbers 300 Reduce Numbers Hadoop Time(s) Reduces) (c) Reduces numbers Fig. 5: Kirchhoff Experiment on Hadoop (2). Experiments on Spark: in Spark, the contribution of two factors, memory of each execution(container) and number of RDD partitions, are investigated. a. We adjust the configuration of the container s memory and evaluate the performance of the system. The results are shown in Fig.6.(a). Like Hadoop, if the memory of a container exceeds to a proper value, the execution time grows longer. Because the large memory affects the parallel number of tasks. When the number of parallel tasks is small, the execute time all the tasks also increases. b. In this part, we test the RDD partitions about the input traces, hope to find out the best partitions. The test results are shown in Fig.6.(b). The figure indicates that when the number of RDD partitions is not enough to the cluster, it takes a longer time, but when partitions grows, it tends to a less and stable execution time. The reason of this phenomenon is that when RDD partition is more than the cluster resources can have,
11 it makes the cluster run busy to handle RDDs with the same amount of time. 500 Container Memory Spark 600 RDD Partitions Spark Time(s) 400 Time(s) Memory(G) Partitions (a) Container Memory (b) RDD partition Fig. 6: Kirchhoff Experiment on Spark (3). Experiments of comparison between Hadoop and Spark: We mainly from three aspects to compare the Hadoop algorithm and Spark algorithm. a. Firstly, we compare the yarn container memory between Hadoop and Spark. On Yarn, the tasks of Hadoop or Spark jobs are started in container, one task correspond to a container. The comparison figure is shown in Fig.7.(a). As shown in the figure, we know that with the same memory of container, Spark shows a better performance, Because Spark use the RDDs to read input traces and persists data in memory, the computation of RDDs is in memory, so it runs fast. b. Secondly, we compare the reading and writing time of Hadoop and Spark I/O capacity. The results are shown in Fig.7.(b). We can see that when we access the same capacity of I/O data, the Spark algorithm runs faster than the Hadoop s. During the same period, Spark s I/O capacity is larger than Hadoop s I/O capacity. c. Lastly, we compare the running time of the Kirchhoff application on Hadoop and Spark framework. The results are shown in Fig.7.(c). We can see that with the RDD mechanism and the optimum of the algorithm, PKTM on SparK achieve better efficiency than that on Hadoop.
12 800 Hadoop vs Spark Hadoop Spark 5 Hadoop vs Spark Hadoop Spark Time(s) Time(m) Memory(G) (a) Container Memory Comparison Capacity(G) (b) I/O with Time Hadoop vs Spark Hadoop Spark 2.0 Time(h) Slaves (c) Running Time Comparison Fig. 7: Kirchhoff Hadoop vs Spark 5 Conclusion In this paper, we proposed two parallel algorithms of pre-stack Kirchhoff time migration based on Hadoop and Sark respectively. The results show that both of the implementations are efficient and scalable. And PKTM on Spark exhibits better performance than the one on Hadoop. The future work includes how to improve the data transferring speed in both Hadoop and Spark, since the efficiency of these 2 programs are closely related to the data preparing speed. If the machine has a high throughput of the network and the high-speed hard disk I/O, the program will run faster. In Hadoop, we can apply RDMA(Remote Direct Memory Access) in HDFS through Infiniband[14]. This will greatly promote the acceleration of HDFS read and write time. We hope we can also apply Infiniband Network Interface Card to accelerate the network transfer in Spark. Acknowledgments. We would like to thank the anonymous reviewers for helping us refine this paper. Their constructive comments and suggestions are very helpful. This paper is partly funded by National Science and Technology Major Project of the Ministry of Science and Technology of China under grant 2011ZX HZ. The corresponding author of this paper is Tang Jie.
13 References 1. Shi X, Wang X, Zhao C, et al.: Practical pre-stack kirchhoff time migration of seismic processing on general purpose gpu[c]//2009 World Congress on Computer Science and Information Engineering. IEEE, 2009: ChengXueQi, JinXiaoLong, WangYuanZhuo, etc.: Summary of the Big Data systems and analysis techniques[j] in Chinese. Journal of Software, 2014, 25(9). 3. Gu L, Li H.: Memory or time: Performance evaluation for iterative operation on hadoop and spark[c]//high Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPC- C EUC), 2013 IEEE 10th International Conference on. IEEE, 2013: Apache Hadoop. [Online]. Available: 5. Zaharia M, Konwinski A, Joseph A D, et al.: Improving MapReduce Performance in Heterogeneous Environments[C]//OSDI. 2008, 8(4): Apache Spark. [Online]. Available: 7. Zaharia M, Chowdhury M, Franklin M J, et al.: Spark: cluster computing with working sets[c]//proceedings of the 2nd USENIX conference on Hot topics in cloud computing. 2010, 10: Zaharia M, Chowdhury M, Das T, et al.: Resilient distributed datasets: A faulttolerant abstraction for in-memory cluster computing[c]//proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012: Rizvandi N B, Boloori A J, Kamyabpour N, et al.: MapReduce implementation of prestack Kirchhoff time migration (PKTM) on seismic data[c]//parallel and Distributed Computing, Applications and Technologies (PDCAT), th International Conference on. IEEE, 2011: De Verdiere G C.: Introduction to GPGPU, a hardware and software background[j]. Comptes Rendus Mcanique, 2011, 339(2): Panetta J, Teixeira T, de Souza Filho P R P, et al.: Accelerating Kirchhoff migration by CPU and GPU cooperation[c]//computer Architecture and High Performance Computing, SBAC-PAD st International Symposium on. IEEE, 2009: Gao H, Tang J, Wu G.: A MapReduce Computing Framework Based on GPU Cluster[C]//High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC EUC), 2013 IEEE 10th International Conference on. IEEE, 2013: WangGang, TangJie, WuGangShan.: GPU-based Cluster Framework[J] in Chinese. Computer Science and Development, 2014, 24(1): Lu X, Islam N S, Wasi-ur-Rahman M, et al.: High-performance design of Hadoop RPC with RDMA over InfiniBand[C]//Parallel Processing (ICPP), nd International Conference on. IEEE, 2013:
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
HiBench Introduction. Carson Wang ([email protected]) Software & Services Group
HiBench Introduction Carson Wang ([email protected]) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
CSE-E5430 Scalable Cloud Computing Lecture 11
CSE-E5430 Scalable Cloud Computing Lecture 11 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 30.11-2015 1/24 Distributed Coordination Systems Consensus
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National
From GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu [email protected] MapReduce/Hadoop
Big Data and Scripting Systems beyond Hadoop
Big Data and Scripting Systems beyond Hadoop 1, 2, ZooKeeper distributed coordination service many problems are shared among distributed systems ZooKeeper provides an implementation that solves these avoid
Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
Unified Big Data Analytics Pipeline. 连 城 [email protected]
Unified Big Data Analytics Pipeline 连 城 [email protected] What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an
The Improved Job Scheduling Algorithm of Hadoop Platform
The Improved Job Scheduling Algorithm of Hadoop Platform Yingjie Guo a, Linzhi Wu b, Wei Yu c, Bin Wu d, Xiaotian Wang e a,b,c,d,e University of Chinese Academy of Sciences 100408, China b Email: [email protected]
Apache Spark and Distributed Programming
Apache Spark and Distributed Programming Concurrent Programming Keijo Heljanko Department of Computer Science University School of Science November 25th, 2015 Slides by Keijo Heljanko Apache Spark Apache
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input
Spark. Fast, Interactive, Language- Integrated Cluster Computing
Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Processing Large Amounts of Images on Hadoop with OpenCV
Processing Large Amounts of Images on Hadoop with OpenCV Timofei Epanchintsev 1,2 and Andrey Sozykin 1,2 1 IMM UB RAS, Yekaterinburg, Russia, 2 Ural Federal University, Yekaterinburg, Russia {eti,avs}@imm.uran.ru
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
Big Data Primer. 1 Why Big Data? Alex Sverdlov [email protected]
Big Data Primer Alex Sverdlov [email protected] 1 Why Big Data? Data has value. This immediately leads to: more data has more value, naturally causing datasets to grow rather large, even at small companies.
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
Research on Job Scheduling Algorithm in Hadoop
Journal of Computational Information Systems 7: 6 () 5769-5775 Available at http://www.jofcis.com Research on Job Scheduling Algorithm in Hadoop Yang XIA, Lei WANG, Qiang ZHAO, Gongxuan ZHANG School of
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
Optimization and analysis of large scale data sorting algorithm based on Hadoop
Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,
Hadoop: Embracing future hardware
Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop
Analysis and Modeling of MapReduce s Performance on Hadoop YARN
Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University [email protected] Dr. Thomas C. Bressoud Dept. of Mathematics and
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:
http://www.paper.edu.cn
5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Map Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
Survey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012
Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012 1 Market Trends Big Data Growing technology deployments are creating an exponential increase in the volume
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
Hadoop. History and Introduction. Explained By Vaibhav Agarwal
Hadoop History and Introduction Explained By Vaibhav Agarwal Agenda Architecture HDFS Data Flow Map Reduce Data Flow Hadoop Versions History Hadoop version 2 Hadoop Architecture HADOOP (HDFS) Data Flow
GraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
Big Data Storage Architecture Design in Cloud Computing
Big Data Storage Architecture Design in Cloud Computing Xuebin Chen 1, Shi Wang 1( ), Yanyan Dong 1, and Xu Wang 2 1 College of Science, North China University of Science and Technology, Tangshan, Hebei,
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Brave New World: Hadoop vs. Spark
Brave New World: Hadoop vs. Spark Dr. Kurt Stockinger Associate Professor of Computer Science Director of Studies in Data Science Zurich University of Applied Sciences Datalab Seminar, Zurich, Oct. 7,
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Amazon EC2 Product Details Page 1 of 5
Amazon EC2 Product Details Page 1 of 5 Amazon EC2 Functionality Amazon EC2 presents a true virtual computing environment, allowing you to use web service interfaces to launch instances with a variety of
A survey on platforms for big data analytics
Singh and Reddy Journal of Big Data 2014, 1:8 SURVEY PAPER Open Access A survey on platforms for big data analytics Dilpreet Singh and Chandan K Reddy * * Correspondence: [email protected] Department
Generic Log Analyzer Using Hadoop Mapreduce Framework
Generic Log Analyzer Using Hadoop Mapreduce Framework Milind Bhandare 1, Prof. Kuntal Barua 2, Vikas Nagare 3, Dynaneshwar Ekhande 4, Rahul Pawar 5 1 M.Tech(Appeare), 2 Asst. Prof., LNCT, Indore 3 ME,
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
Big Data Analytics. Lucas Rego Drumond
Big Data Analytics Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Apache Spark Apache Spark 1
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com Spark Fast & Expressive Cluster computing engine Compatible with Hadoop Came
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan IBM Research China IBM Almaden
MapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.
MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Motivation: Large Scale Data Processing Many tasks: Process lots of data to produce other data Want to use
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
In Memory Accelerator for MongoDB
In Memory Accelerator for MongoDB Yakov Zhdanov, Director R&D GridGain Systems GridGain: In Memory Computing Leader 5 years in production 100s of customers & users Starts every 10 secs worldwide Over 15,000,000
BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
VariantSpark: Applying Spark-based machine learning methods to genomic information
VariantSpark: Applying Spark-based machine learning methods to genomic information Aidan R. O BRIEN a a,1 and Denis C. BAUER a CSIRO, Health and Biosecurity Flagship Abstract. Genomic information is increasingly
Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
Detection of Distributed Denial of Service Attack with Hadoop on Live Network
Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
Parallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai [email protected] MapReduce is a parallel programming model
HADOOP PERFORMANCE TUNING
PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The
Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
Hybrid Software Architectures for Big Data. [email protected] @hurence http://www.hurence.com
Hybrid Software Architectures for Big Data [email protected] @hurence http://www.hurence.com Headquarters : Grenoble Pure player Expert level consulting Training R&D Big Data X-data hot-line
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
Getting to know Apache Hadoop
Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the
Enabling High performance Big Data platform with RDMA
Enabling High performance Big Data platform with RDMA Tong Liu HPC Advisory Council Oct 7 th, 2014 Shortcomings of Hadoop Administration tooling Performance Reliability SQL support Backup and recovery
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
MapReduce Evaluator: User Guide
University of A Coruña Computer Architecture Group MapReduce Evaluator: User Guide Authors: Jorge Veiga, Roberto R. Expósito, Guillermo L. Taboada and Juan Touriño December 9, 2014 Contents 1 Overview
HADOOP MOCK TEST HADOOP MOCK TEST I
http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at
Data-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan [email protected] Abstract Every day, we create 2.5 quintillion
Efficient Data Replication Scheme based on Hadoop Distributed File System
, pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,
Evaluating HDFS I/O Performance on Virtualized Systems
Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang [email protected] University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing
Analysis of Information Management and Scheduling Technology in Hadoop
Analysis of Information Management and Scheduling Technology in Hadoop Ma Weihua, Zhang Hong, Li Qianmu, Xia Bin School of Computer Science and Technology Nanjing University of Science and Engineering
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
Big Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
Can High-Performance Interconnects Benefit Memcached and Hadoop?
Can High-Performance Interconnects Benefit Memcached and Hadoop? D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
Platforms and Algorithms for Big Data Analytics Chandan K. Reddy Department of Computer Science Wayne State University
Platforms and Algorithms for Big Data Analytics Chandan K. Reddy Department of Computer Science Wayne State University http://www.cs.wayne.edu/~reddy/ http://dmkd.cs.wayne.edu/tutorial/bigdata/ What is
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster
, pp.11-20 http://dx.doi.org/10.14257/ ijgdc.2014.7.2.02 A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster Kehe Wu 1, Long Chen 2, Shichao Ye 2 and Yi Li 2 1 Beijing
A Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
Developing MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
Hadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Beyond Hadoop with Apache Spark and BDAS
Beyond Hadoop with Apache Spark and BDAS Khanderao Kand Principal Technologist, Guavus 12 April GITPRO World 2014 Palo Alto, CA Credit: Some stajsjcs and content came from presentajons from publicly shared
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.
CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensie Computing Uniersity of Florida, CISE Department Prof. Daisy Zhe Wang Map/Reduce: Simplified Data Processing on Large Clusters Parallel/Distributed
Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems
215 IEEE International Conference on Big Data (Big Data) Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems Guoxin Liu and Haiying Shen and Haoyu Wang Department of Electrical
Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc [email protected]
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc [email protected] What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions
Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)
UC BERKELEY Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II) Anthony D. Joseph LASER Summer School September 2013 My Talks at LASER 2013 1. AMP Lab introduction 2. The Datacenter
