Hadoop on a Low-Budget General Purpose HPC Cluster in Academia

Size: px
Start display at page:

Download "Hadoop on a Low-Budget General Purpose HPC Cluster in Academia"

Transcription

1 Hadoop on a Low-Budget General Purpose HPC Cluster in Academia Paolo Garza, Paolo Margara, Nicolò Nepote, Luigi Grimaudo, and Elio Piccolo Dipartimento di Automatica e Informatica, Politecnico di Torino, Corso Duca degli Abruzzi, Torino (Italy) Abstract. In the last decade, we witnessed an increasing interest in High Performance Computing (HPC) infrastructures, which play an important role in both academic and industrial research projects. At the same time, due to the increasing amount of available data, we also witnessed the introduction of new frameworks and applications based on the MapReduce paradigm (e.g., Hadoop). Traditional HPC systems are usually designed for CPU- and memory-intensive applications. However, the use of already available HPC infrastructures for data-intensive applications is an interesting topic, in particular in academia where the budget is usually limited and the same cluster is used by many researchers with different requirements. In this paper, we investigate the integration of Hadoop, and its performance, in an already existing low-budget general purpose HPC cluster characterized by heterogeneous nodes and a low amount of secondary memory per node. Keywords: HPC, Hadoop, MapReduce, MPI applications 1 Introduction The amount of available data increases every day. This huge amount of data is a resource that, if properly exploited, provides useful knowledge. However, to be able to extract useful knowledge from it, efficient and powerful systems are needed. One possible solution to the introduced problem consists in adopting the Hadoop framework [6], which exploits the MapReduce [1] paradigm for the efficient implementation of data-intensive distributed applications. The recent years have also witnessed the increasing availability of general purpose HPC systems [3], such as clusters, commonly installed in many computing centers. They are usually used to provide different services to communities of users (e.g., academic researches) with different requirements. These systems are usually designed for CPU- and memory-intensive applications. However, we witnessed some attempts to integrate Hadoop also in general purpose HPC systems, in particular in academia. Due to limited budgets, the integration of Hadoop in already available HPC systems is an interesting and fascinating problem. It will allow academic researches to continue to use their current MPI-based applications and, at the same time, to exploit Hadoop to address new (data-intensive) problems without further costs.

2 In this paper, we describe the integration of Hadoop inside an academic HPC cluster located at the Politecnico di Torino computing center This cluster, called CASPER, hosts dozens of scientific softwares, often based on MPI. We decided to integrate Hadoop in CASPER to understand if it is able to manage also huge data or if an upgrade is needed for this new purpose. Our main goals consist in continuing to provide the already available services, based on MPI applications, and the new ones based on Hadoop using the same system. 2 (www.hpc.polito.it) aims at providing computational resources and technical support to both academic research and university teaching. To pursue these goals, the computing center has set up a heterogeneous cluster called CASPER (Cluster Appliance for Scientific Parallel Execution and Rendering) with a peak performance of 1.2 TFLOPS. The initiative counts 25 hosted projects and 12 papers developed thanks to our HPC and published by groups operating in different research areas. A detailed technical description of the system is available in previous papers [2] [5]. In our vision CASPER is a continuously evolving system, developed in collaboration with several research groups, and being renewed regularly. 2.1 Cluster configuration CASPER is a standard MIMD distributed shared memory InfiniBand heterogeneous cluster. It has 1 master node and 9 computational nodes, 136 cores, and 632 GB of main memory. From Figure 1, you can see that the system is composed of three different types of computational nodes, which have been added to the cluster in subsequent stages according to the needs of the research groups. The cluster is evolving into a massively parallel, large memory system, having average cpu speed and many cores per node. The nodes have very small and low performance local hard disks, which have been designed to contain only the operating system; experimental data are maintained in a central NAS. CASPER is normally used through the scheduler/resource manager SGE (now OGE) for running custom code or third-party software, often taking advantage of MPI libraries (e.g., Esys-Particle, Matlab, OpenFOAM, Quantum- Espresso). The cluster configuration is therefore a compromise that provides sufficient performance for all of the cited softwares. However, the current applications are rarely data-intensive and do not use huge data. 2.2 Hadoop Deployment The installation of Apache Hadoop was made trying to harmonize its needs to those of our specific system. CASPER is an installation of Rocks Cluster

3 Fig. 1. CASPER cluster configuration, early To install Hadoop on it, we used the packages from the EPEL repository for CentOS, from which the Rocks Cluster distribution derives. We decided to use the master node of the cluster also as master node of Hadoop, while the computation nodes were configured as slave nodes. HDFS was configured with permissions disabled, data block size set to 256MB and number of replicas per data block set to 2. On five nodes (nodes from 0-0 to 0-4), we added 1TB local hard disks for the exclusive use of HDFS. On the remaining four nodes, data was stored in /state/partition1, which is a partition created by the Rocks node installer and corresponds to all the remaining space on the local disk (the local disk size is 160GB for these four nodes). We configure Hadoop by taking into consideration the heterogeneous nature of CASPER. More specifically, for each node the maximum number of mappers was set equal to the number of cores, while the maximum number of reducers was set to 4 for the five nodes with a large and efficient 1TB local disk, and to 0 for the other four nodes due to the slow efficiency and small size and their hard disks. The package provided by Oracle to integrate Hadoop into our version of SGE is incompatible with the current version of Hadoop. Hence, we decided to use an integration approach based on the creation of a dedicated queue and a new Parallel Environment in the SGE scheduler, so that the Hadoop tasks are still subjected to Hadoop but managed through the queuing system of the cluster. 3 Experimental results As described in the introduction of the paper, for some future research applications we need to use CASPER on large/huge data. However, on the one hand we cannot buy a new dedicated cluster. On the other hand we cannot dismiss the softwares already hosted on CASPER. Hence, we performed an initial set of experiments to understand the scalability, in terms of data size, of an MPIapplication. These experiments allowed us to identify the data size limit of MPI programs on our cluster. Then, we performed a set of experiments based on

4 Hadoop (i) to evaluate the scalability of Hadoop on CASPER and (ii) to understand which upgrades are needed to be able to use CASPER on big data. 3.1 Experimental setting. To evaluate the scalability of MPI-based algorithms, we used an MPI implementation of the quicksort algorithm derived from a public available code [4]. It receives in input a single file containing a set of numbers (one number per line) and generates one single file corresponding to the sorted version of the input one. The algorithm, similarly to all typical MPI applications, works exclusively in main memory. One of the benchmarking test usually performed to analyze efficiency and scalability of Hadoop is the Hadoop-based implementation of Terasort [6]. Hence, we did the same also to test the scalability of Hadoop on our cluster. 3.2 MPI-based sorting algorithm The scalability of the MPI sorting algorithm was evaluated on files ranging from 21GB to 100GB. Larger files are not manageable due to the memory-based nature of the algorithm. Detailed results are reported in Table 1(a) for three difference configurations, characterized by a different number of nodes/cores. We considered initially only the nodes 0-8 and 0-9 reported in Figure 1, then nodes from 0-6 to 0-9, and finally all nodes. The first configuration is homogeneous but it is characterized by only 64 cores, while the last one is the most heterogeneous but it exploits all the available resources. The results reported in Table 1(a) highlight that our MPI sorting algorithm is not able to process file larger than 100GB. Hence, it cannot manage big data. As expected, the execution time decreases when the number of nodes/cores increases. Figure 2(a) shows that the execution time decreases linearly with respect to the number of cores. However, the slope of the curves depend on the file size. Hence, the availability of more cores is potentially a positive factor. However, the increase of the number of nodes, in some cases, has a negative impact. More (a) MPI-based sorting algorithm. DNF=Did not finished. Dataset Configuration Execution size (GB) #cores/total RAM(GB) time 64 cores/256gb 39m49s 21GB 96 cores/512gb 31m52s 136 cores/632gb 26m30s 64 cores/256gb 1h41m32s 42GB 96 cores/512gb 1h24m41s 136 cores/632gb DNF 64 cores/256gb 3h38m59s 100GB 96 cores/512gb 2h52m59s 136 cores/632gb DNF Table 1. Execution time (b) Hadoop-based terasort. Dataset Configuration Execution size (GB) #cores/total RAM(GB) time 100GB 40 cores/120gb 1h23m2s 136 cores/632gb 57m26s 200GB 40 cores/120gb 3h52m57s 136 cores/632gb 3h21m49s

5 Execution time (s) GB 42GB 21GB Number of cores (a) MPI-based sorting algorithm. Execution time (s) GB 100GB Number of cores (b) Hadoop-based terasort. Fig. 2. Execution time specifically, if we use all nodes, the sorting process does not end with files larger than or equal to 42 GB. The problem is given by the (limited) size of the RAM of the 5 Intel nodes (24GB per node). They are not able to process the tasks assigned to them by the MPI-based sorting program when the file size is larger than approximatively 40GB. 3.3 Terasort (Hadoop-based application) Since the MPI application does not allow processing large files on CASPER, we decided to test Hadoop on it. Hadoop is usually used on commodity hardware. However, CASPER has a set of peculiarity (e.g., it is extremely heterogeneous) and hence it could be not adapt for Hadoop. We performed the tests by means of a standard algorithm that is called Terasort. We decided to evaluate Hadoop-based implementation of Terasort on two extreme configurations. The first configuration is based only on the 5 Intel nodes (40 cores), while the second one exploits all nodes (136 cores). The first configuration is homogeneous (5 Intel 3.2GHz nodes with 1TB of secondary memory per node). The second one is extremely heterogeneous (different CPU frequencies and local disks with size ranging from 160GB to 1TB). The results reported in Table 1(b) and Figure 2(b) show that the Hadoopbased Terasort algorithm can process in less than 4 hours a 200GB file. Hence, also on CASPER, which is not designed for Hadoop, the use of Hadoop allows processing files larger than those manageable by means of MPI algorithms. However, the time of the first configuration (5 homogeneous nodes) is slightly slower than the second one (composed of all nodes). The second configuration has +240% more cores than the first one but the execution time decrease is only -13% when the file size is 200GB (-31% when the file size is 100GB). These results confirm than we need more homogeneous nodes in our cluster and on the average larger local disks on each computational node if we want to increase the scalability of CASPER for data-intensive applications based on Hadoop. We will consider this important point during the planning of the next upgrade of CASPER.

6 General considerations. Based on the achieved results, we can conclude that CASPER can potentially be used to run both the already hosted MPI-based applications and new Hadoop-based applications. However, some upgrades are needed in order to improve the performance of CASPER on large datasets. The results reported in Sections 3.2 and 3.3 can be exploited also to decide how to allocate the different applications on CASPER. Hadoop seems to achieve better results when homogeneous nodes, with large and efficient local disks, are used (i.e., the 5 Intel nodes in our current system), while the MPI-based application, which is a main memory-intensive application, seems to perform better on nodes with more efficient processors and a large amount of main memory (i.e., the AMD nodes in our current system). On CASPER, analogously to all traditional clusters, a set of queues can be created. Each queue is associated with a set of nodes and can be characterized by a priority level. Based on the discussed results, the association of the applications based on Hadoop to a queue that includes the 5 Intel nodes, and the association of the MPI-based applications to a queue that includes the AMD nodes seems to be, potentially, a good configuration. This configuration should allow to execute contemporaneously Hadoopand MPI-based applications. 4 Conclusions Due to the increasing request of data-intensive applications, we decided to analyze the potentiality of our low-budget general purpose cluster for this type of applications. In this paper, we reported the results of this experience. The performed experiments highlighted the limitations of our current cluster and helped us to identify potential upgrades that should be considered in the future. Further experiments will be performed on other algorithms (e.g., the merge sort algorithm) to confirm the achieved results. References 1. J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1): , F. Della Croce, E. Piccolo, and N. Nepote. A terascale, cost-effective open solution for academic computing: early experience of the dauin hpc initiative. In AICA 2011, pages 1 9, J. Dongarra. Trends in high performance computing: a historical overview and examination of future developments. IEEE Circuits and Devices Magazine, 22(1):22 27, P. Maier. qsort.c. pm175/f21dp2/src/, N. Nepote, E. Piccolo, C. Demartini, and P. Montuschi. Why and how using HPC in university teaching? a case study at polito. In DIDAMATICA 2013, pages , T. White. Hadoop: The Definitive Guide. O Reilly Media, Inc., 1st edition, 2009.

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

More information

Politecnico di Torino. Porto Institutional Repository

Politecnico di Torino. Porto Institutional Repository Politecnico di Torino Porto Institutional Repository [Proceeding] NEMICO: Mining network data through cloud-based data mining techniques Original Citation: Baralis E.; Cagliero L.; Cerquitelli T.; Chiusano

More information

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Detection of Distributed Denial of Service Attack with Hadoop on Live Network Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,

More information

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in

More information

MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

A Performance Analysis of Distributed Indexing using Terrier

A Performance Analysis of Distributed Indexing using Terrier A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),

More information

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications Performance Comparison of Intel Enterprise Edition for Lustre software and HDFS for MapReduce Applications Rekha Singhal, Gabriele Pacciucci and Mukesh Gangadhar 2 Hadoop Introduc-on Open source MapReduce

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES 1 MYOUNGJIN KIM, 2 CUI YUN, 3 SEUNGHO HAN, 4 HANKU LEE 1,2,3,4 Department of Internet & Multimedia Engineering,

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Research on Job Scheduling Algorithm in Hadoop

Research on Job Scheduling Algorithm in Hadoop Journal of Computational Information Systems 7: 6 () 5769-5775 Available at http://www.jofcis.com Research on Job Scheduling Algorithm in Hadoop Yang XIA, Lei WANG, Qiang ZHAO, Gongxuan ZHANG School of

More information

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,

More information

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Yan Fisher Senior Principal Product Marketing Manager, Red Hat Rohit Bakhshi Product Manager,

More information

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW 757 Maleta Lane, Suite 201 Castle Rock, CO 80108 Brett Weninger, Managing Director brett.weninger@adurant.com Dave Smelker, Managing Principal dave.smelker@adurant.com

More information

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk. Load Rebalancing for Distributed File Systems in Clouds. Smita Salunkhe, S. S. Sannakki Department of Computer Science and Engineering KLS Gogte Institute of Technology, Belgaum, Karnataka, India Affiliated

More information

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created

More information

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems Aysan Rasooli Department of Computing and Software McMaster University Hamilton, Canada Email: rasooa@mcmaster.ca Douglas G. Down

More information

Mobile Cloud Computing for Data-Intensive Applications

Mobile Cloud Computing for Data-Intensive Applications Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, vct@andrew.cmu.edu Advisor: Professor Priya Narasimhan, priya@cs.cmu.edu Abstract The computational and storage

More information

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute

More information

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File

More information

Extending Hadoop beyond MapReduce

Extending Hadoop beyond MapReduce Extending Hadoop beyond MapReduce Mahadev Konar Co-Founder @mahadevkonar (@hortonworks) Page 1 Bio Apache Hadoop since 2006 - committer and PMC member Developed and supported Map Reduce @Yahoo! - Core

More information

Enabling Multi-pipeline Data Transfer in HDFS for Big Data Applications

Enabling Multi-pipeline Data Transfer in HDFS for Big Data Applications Enabling Multi-pipeline Data Transfer in HDFS for Big Data Applications Liqiang (Eric) Wang, Hong Zhang University of Wyoming Hai Huang IBM T.J. Watson Research Center Background Hadoop: Apache Hadoop

More information

Hadoop on the Gordon Data Intensive Cluster

Hadoop on the Gordon Data Intensive Cluster Hadoop on the Gordon Data Intensive Cluster Amit Majumdar, Scientific Computing Applications Mahidhar Tatineni, HPC User Services San Diego Supercomputer Center University of California San Diego Dec 18,

More information

TUNING THE PERFORMANCE OF HADOOP MAP REDUCE JOBS BY ALTERINGVARIOUS PARAMETERS

TUNING THE PERFORMANCE OF HADOOP MAP REDUCE JOBS BY ALTERINGVARIOUS PARAMETERS TUNING THE PERFORMANCE OF HADOOP MAP REDUCE JOBS BY ALTERINGVARIOUS PARAMETERS Dr. D Rajya Lakshmi 1, Mr. R Praveen Kumar 2, Mr. N K Sumanth 3 1 Professor of CSE, JNTUK-UCEV, Vizianagaram, AP, (India)

More information

Distributed Framework for Data Mining As a Service on Private Cloud

Distributed Framework for Data Mining As a Service on Private Cloud RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &

More information

Task Scheduling in Hadoop

Task Scheduling in Hadoop Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed

More information

Automating Big Data Benchmarking for Different Architectures with ALOJA

Automating Big Data Benchmarking for Different Architectures with ALOJA www.bsc.es Jan 2016 Automating Big Data Benchmarking for Different Architectures with ALOJA Nicolas Poggi, Postdoc Researcher Agenda 1. Intro on Hadoop performance 1. Current scenario and problematic 2.

More information

Survey on Scheduling Algorithm in MapReduce Framework

Survey on Scheduling Algorithm in MapReduce Framework Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India

More information

Reduction of Data at Namenode in HDFS using harballing Technique

Reduction of Data at Namenode in HDFS using harballing Technique Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu vgkorat@gmail.com swamy.uncis@gmail.com Abstract HDFS stands for the Hadoop Distributed File System.

More information

Recommended hardware system configurations for ANSYS users

Recommended hardware system configurations for ANSYS users Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range

More information

An Application of Hadoop and Horizontal Scaling to Conjunction Assessment. Mike Prausa The MITRE Corporation Norman Facas The MITRE Corporation

An Application of Hadoop and Horizontal Scaling to Conjunction Assessment. Mike Prausa The MITRE Corporation Norman Facas The MITRE Corporation An Application of Hadoop and Horizontal Scaling to Conjunction Assessment Mike Prausa The MITRE Corporation Norman Facas The MITRE Corporation ABSTRACT This paper examines a horizontal scaling approach

More information

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT

More information

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce 2012 Third International Conference on Networking and Computing Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Shiori KURAZUMI, Tomoaki TSUMURA, Shoichi SAITO and Hiroshi

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be

More information

GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

More information

MapReduce Evaluator: User Guide

MapReduce Evaluator: User Guide University of A Coruña Computer Architecture Group MapReduce Evaluator: User Guide Authors: Jorge Veiga, Roberto R. Expósito, Guillermo L. Taboada and Juan Touriño December 9, 2014 Contents 1 Overview

More information

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Analysis and Modeling of MapReduce s Performance on Hadoop YARN Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and

More information

A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs In a Workflow Application

A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs In a Workflow Application 2012 International Conference on Information and Computer Applications (ICICA 2012) IPCSIT vol. 24 (2012) (2012) IACSIT Press, Singapore A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

Evaluating HDFS I/O Performance on Virtualized Systems

Evaluating HDFS I/O Performance on Virtualized Systems Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang xtang@cs.wisc.edu University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing

More information

Performance Optimization of a Distributed Transcoding System based on Hadoop for Multimedia Streaming Services

Performance Optimization of a Distributed Transcoding System based on Hadoop for Multimedia Streaming Services RESEARCH ARTICLE Adv. Sci. Lett. 4, 400 407, 2011 Copyright 2011 American Scientific Publishers Advanced Science Letters All rights reserved Vol. 4, 400 407, 2011 Printed in the United States of America

More information

What is Big Data? Concepts, Ideas and Principles. Hitesh Dharamdasani

What is Big Data? Concepts, Ideas and Principles. Hitesh Dharamdasani What is Big Data? Concepts, Ideas and Principles Hitesh Dharamdasani # whoami Security Researcher, Malware Reversing Engineer, Developer GIT > George Mason > UC Berkeley > FireEye > On Stage Building Data-driven

More information

Dell Reference Configuration for Hortonworks Data Platform

Dell Reference Configuration for Hortonworks Data Platform Dell Reference Configuration for Hortonworks Data Platform A Quick Reference Configuration Guide Armando Acosta Hadoop Product Manager Dell Revolutionary Cloud and Big Data Group Kris Applegate Solution

More information

MapReduce and Hadoop Distributed File System V I J A Y R A O

MapReduce and Hadoop Distributed File System V I J A Y R A O MapReduce and Hadoop Distributed File System 1 V I J A Y R A O The Context: Big-data Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009) Google collects 270PB data in a month (2007), 20000PB

More information

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances INSIGHT Oracle's All- Out Assault on the Big Data Market: Offering Hadoop, R, Cubes, and Scalable IMDB in Familiar Packages Carl W. Olofson IDC OPINION Global Headquarters: 5 Speen Street Framingham, MA

More information

Cloud Computing through Virtualization and HPC technologies

Cloud Computing through Virtualization and HPC technologies Cloud Computing through Virtualization and HPC technologies William Lu, Ph.D. 1 Agenda Cloud Computing & HPC A Case of HPC Implementation Application Performance in VM Summary 2 Cloud Computing & HPC HPC

More information

Use of Hadoop File System for Nuclear Physics Analyses in STAR

Use of Hadoop File System for Nuclear Physics Analyses in STAR 1 Use of Hadoop File System for Nuclear Physics Analyses in STAR EVAN SANGALINE UC DAVIS Motivations 2 Data storage a key component of analysis requirements Transmission and storage across diverse resources

More information

Efficient Data Replication Scheme based on Hadoop Distributed File System

Efficient Data Replication Scheme based on Hadoop Distributed File System , pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,

More information

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters A Framework for Performance Analysis and Tuning in Hadoop Based Clusters Garvit Bansal Anshul Gupta Utkarsh Pyne LNMIIT, Jaipur, India Email: [garvit.bansal anshul.gupta utkarsh.pyne] @lnmiit.ac.in Manish

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

On the Varieties of Clouds for Data Intensive Computing

On the Varieties of Clouds for Data Intensive Computing On the Varieties of Clouds for Data Intensive Computing Robert L. Grossman University of Illinois at Chicago and Open Data Group Yunhong Gu University of Illinois at Chicago Abstract By a cloud we mean

More information

HPCHadoop: A framework to run Hadoop on Cray X-series supercomputers

HPCHadoop: A framework to run Hadoop on Cray X-series supercomputers HPCHadoop: A framework to run Hadoop on Cray X-series supercomputers Scott Michael, Abhinav Thota, and Robert Henschel Pervasive Technology Institute Indiana University Bloomington, IN, USA Email: scamicha@iu.edu

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

Performance measurement of a Hadoop Cluster

Performance measurement of a Hadoop Cluster Performance measurement of a Hadoop Cluster Technical white paper Created: February 8, 2012 Last Modified: February 23 2012 Contents Introduction... 1 The Big Data Puzzle... 1 Apache Hadoop and MapReduce...

More information

Cloud Computing based on the Hadoop Platform

Cloud Computing based on the Hadoop Platform Cloud Computing based on the Hadoop Platform Harshita Pandey 1 UG, Department of Information Technology RKGITW, Ghaziabad ABSTRACT In the recent years,cloud computing has come forth as the new IT paradigm.

More information

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010 Hadoop s Entry into the Traditional Analytical DBMS Market Daniel Abadi Yale University August 3 rd, 2010 Data, Data, Everywhere Data explosion Web 2.0 more user data More devices that sense data More

More information

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012 Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012 1 Market Trends Big Data Growing technology deployments are creating an exponential increase in the volume

More information

StackIQ Enterprise Data Reference Architecture

StackIQ Enterprise Data Reference Architecture white paper StackIQ Enterprise Data Reference Architecture StackIQ and Hortonworks worked together to Bring You World-class Reference Configurations for Apache Hadoop Clusters. Abstract Contents The Need

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Processing of Hadoop using Highly Available NameNode

Processing of Hadoop using Highly Available NameNode Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

Accelerating and Simplifying Apache

Accelerating and Simplifying Apache Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly

More information

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2 Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data

More information

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC HPC Architecture End to End Alexandre Chauvin Agenda HPC Software Stack Visualization National Scientific Center 2 Agenda HPC Software Stack Alexandre Chauvin Typical HPC Software Stack Externes LAN Typical

More information

Design and performance report for Hadoop on Intel Xeon-based HP ProLiant DL380 servers

Design and performance report for Hadoop on Intel Xeon-based HP ProLiant DL380 servers Technical white paper Design and performance report for Hadoop on Intel Xeon-based HP ProLiant DL380 servers Table of contents Executive summary... 2 Introduction... 2 Test topology... 3 System design

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

Hadoop Cluster Applications

Hadoop Cluster Applications Hadoop Overview Data analytics has become a key element of the business decision process over the last decade. Classic reporting on a dataset stored in a database was sufficient until recently, but yesterday

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Building a Top500-class Supercomputing Cluster at LNS-BUAP Building a Top500-class Supercomputing Cluster at LNS-BUAP Dr. José Luis Ricardo Chávez Dr. Humberto Salazar Ibargüen Dr. Enrique Varela Carlos Laboratorio Nacional de Supercómputo Benemérita Universidad

More information

Here comes the flood Tools for Big Data analytics. Guy Chesnot -June, 2012

Here comes the flood Tools for Big Data analytics. Guy Chesnot -June, 2012 Here comes the flood Tools for Big Data analytics Guy Chesnot -June, 2012 Agenda Data flood Implementations Hadoop Not Hadoop 2 Agenda Data flood Implementations Hadoop Not Hadoop 3 Forecast Data Growth

More information

Hadoop Scheduler w i t h Deadline Constraint

Hadoop Scheduler w i t h Deadline Constraint Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,

More information

The Greenplum Analytics Workbench

The Greenplum Analytics Workbench The Greenplum Analytics Workbench External Overview 1 The Greenplum Analytics Workbench Definition Is a 1000-node Hadoop Cluster. Pre-configured with publicly available data sets. Contains the entire Hadoop

More information

Adaptive Task Scheduling for MultiJob MapReduce Environments

Adaptive Task Scheduling for MultiJob MapReduce Environments Adaptive Task Scheduling for MultiJob MapReduce Environments Jordà Polo, David de Nadal, David Carrera, Yolanda Becerra, Vicenç Beltran, Jordi Torres and Eduard Ayguadé Barcelona Supercomputing Center

More information

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Mining Large Datasets: Case of Mining Graph Data in the Cloud Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large

More information

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0)

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0) TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx

More information

Purchase of High Performance Computing (HPC) Central Compute Resources by Northwestern Researchers

Purchase of High Performance Computing (HPC) Central Compute Resources by Northwestern Researchers Information Technology Purchase of High Performance Computing (HPC) Central Compute Resources by Northwestern Researchers Effective for FY2016 Purpose This document summarizes High Performance Computing

More information

Cloud Computing Where ISR Data Will Go for Exploitation

Cloud Computing Where ISR Data Will Go for Exploitation Cloud Computing Where ISR Data Will Go for Exploitation 22 September 2009 Albert Reuther, Jeremy Kepner, Peter Michaleas, William Smith This work is sponsored by the Department of the Air Force under Air

More information

Big Data - Infrastructure Considerations

Big Data - Infrastructure Considerations April 2014, HAPPIEST MINDS TECHNOLOGIES Big Data - Infrastructure Considerations Author Anand Veeramani / Deepak Shivamurthy SHARING. MINDFUL. INTEGRITY. LEARNING. EXCELLENCE. SOCIAL RESPONSIBILITY. Copyright

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Big Data in the Enterprise: Network Design Considerations

Big Data in the Enterprise: Network Design Considerations White Paper Big Data in the Enterprise: Network Design Considerations What You Will Learn This document examines the role of big data in the enterprise as it relates to network design considerations. It

More information

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

A Novel Cloud Based Elastic Framework for Big Data Preprocessing School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview

More information

Understanding Hadoop Performance on Lustre

Understanding Hadoop Performance on Lustre Understanding Hadoop Performance on Lustre Stephen Skory, PhD Seagate Technology Collaborators Kelsie Betsch, Daniel Kaslovsky, Daniel Lingenfelter, Dimitar Vlassarev, and Zhenzhen Yan LUG Conference 15

More information

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN 1 PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN Introduction What is cluster computing? Classification of Cluster Computing Technologies: Beowulf cluster Construction

More information

Research Article Hadoop-Based Distributed Sensor Node Management System

Research Article Hadoop-Based Distributed Sensor Node Management System Distributed Networks, Article ID 61868, 7 pages http://dx.doi.org/1.1155/214/61868 Research Article Hadoop-Based Distributed Node Management System In-Yong Jung, Ki-Hyun Kim, Byong-John Han, and Chang-Sung

More information

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

More information