TUNING THE PERFORMANCE OF HADOOP MAP REDUCE JOBS BY ALTERINGVARIOUS PARAMETERS

Size: px
Start display at page:

Download "TUNING THE PERFORMANCE OF HADOOP MAP REDUCE JOBS BY ALTERINGVARIOUS PARAMETERS"

Transcription

1 TUNING THE PERFORMANCE OF HADOOP MAP REDUCE JOBS BY ALTERINGVARIOUS PARAMETERS Dr. D Rajya Lakshmi 1, Mr. R Praveen Kumar 2, Mr. N K Sumanth 3 1 Professor of CSE, JNTUK-UCEV, Vizianagaram, AP, (India) 2,3 Asst. Proff.(C),CSE, JNTUK-UCEV, Vizianagaram, AP, (India) ABSTRACT There is a lot of hype around Hadoop now a day`s, this mistakenly leads to a fact that Hadoop can do anything but it s not. Hadoop is meant for certain kind of applications and not suitable for real time applications. But still we go for Hadoop in enterprises for its extreme horse power in processing huge data in less time.hadoop is an open-source, framework of tools that is used to support the running of applications on Big Data. The simple Hadoop Architecture consists of mainly two components Hadoop Distributed File system and Map-Reduce. HDFS breaks the data/file into fixed size blocks and spreads over the cluster.map-reduce will take care about Computation by moving the code or piece of code to the data/blocks where they are present in a cluster of commodity hardware. Hadoop overcomes the Limitations of Distributed File System by providing the features like stores large data,offers Failure Protection and provides fast access but hadoop is built for specific class of applications. HDFS blocks are large, the default block size is 64MB compared to the block size in other structured-file system 4KB-8KB. The major contribution of this work is Hadoop performance evaluation on writing/reading with variable block sizes that results low performance and full understanding of Map Reduce concept as well and the distribution of jobs.at any stage hadoop clusters can run with a default block sizes, which is fixed (multiples of 64MB). Here reading/writing of such huge filecan vary from one node to other depends on commodity hardware. We are taking only one parameter that is Disks Speed. Traditions Hard Disks speed could be MB/sec, now we have come up with Solid state Disk that could run with a speed of MB/sec. That means the hard disk which have high performance will read the entire file andcomputes fast and others will compute slow. The Job Tracker should wait for allcomputations to return final result to the application. This would causethe performance issue. So by altering the block sizes i.e. for the high performance node havelarge block size and low performance node have small block size. By tuning up the block sizes, all the nodes will have equal performance of read/writes. So that master node will not wait for other nodes for collecting the computations and it leads tobetter performance. There are some other contributions includes Changing the Hadoop Configuration Files, mapper reducer coordination (reducers need not to wait until all mappers finish their task), maximum tasks for a task tracker, networking in the rack architecture or core switch configuration in a cluster and many more works can be done to improve the performance of Hadoop framework. Note: I have given only one example parameter that is Disks Speed. Traditional Hard Disks speed could be MB/sec, with Solid state Disk that could run with a speed of MB/sec. 97 P a g e

2 Keywords: Benchmarks, Combiner, HDFS, HDFS Block Size, MAP REDUCE. I. INTRODUCTION This is an era of Big Data[1]. Where we are struggling to store and process such huge amount of data is the challenging thing. Big Data is simply defined as Data beyond the storage capacity and processing power. How we are getting such data? Here are some examples includes Facebook Data 500+ TB per day, Flight Sensor data 10TB per Half-n-Hour Journey, NYSE (New York Stock Exchange) and Files (Structured/Un-Structured/Semi Structured) etc. Big Data generally meant for Un-Structured / Semi Structured data where the data is in the form of plain logs. These three dimensions (commonly referred to as the 3 V s) can be understood as follows: Volume is the most obvious of the three, referring to the size of the data. The massive volumes of data that we are currently dealing with has required scientists to rethink storage and processing paradigms in order to develop the tools needed to properly analyze it. Velocity addresses the speed at which data can be received as well as analyzed. In the Data processing engines section, we discuss the differences between batch processing, which works on historical data, and stream processing, which analyzes the data in real-time as it is generated. This also refers to the rate of change of data, which is especially relevant in the area of stream processing. Variety refers to the issue of disparate and incompatible data formats. Data can come in from many different sources and take on many different forms, and just preparing it for analysis takes a significant amount of time and effort. How to handle such huge data to store and process? The best solution is Distributed storage and parallel processing called Hadoop. Apache Hadoop is a framework that allows a distributed processing of large data sets across cluster of commodity computers using a simple programming model. Hadoop is generally meant for Big Data. Some people think that Hadoop could do anything but an Open Source Distributed Processing Framework is not always a right choice in every situation. One needs to be carefully evaluate when to use Hadoop. Hadoop is suitable for huge amount of data in a single file rather small amount of data spread in multiple files. Hadoop is not suitable for real time applications. Hadoop built for specific class of applications that is write once read many times. 1.1 Hadoop s Approach: Hadoop [2] is a framework of tools used for supporting running of applications on Big Data. As it follows traditional approach called Distributed Data storage but the computation is Processor bound. It means a powerful computer will be act like a Server(see in fig 1.1). Big data is processed by this powerful enterprise computer which results in various problems like network constraints, latency, Cost etc. Hadoop Approach is similar but the beauty here is computation is also broken and result is combined at the end. No need of high end computers required, Commodity hardware is enough to process big data. Here Data Locality is preserved and the computation is moved towards the data which are stored in a distributed environment. 98 P a g e

3 BIG DATA Processed by POWERFUL COMPUTER Fig:1.1 Traditional Approach BIG DATA Combined Result B broken into Pieces Computation Commodity Hardware Fig:1.2 Hadoop s Approach 1.2 Hadoop Ecosystem Hadoop has two major componentshdfs [3] and Map Reduce [4]. HDFS (Hadoop Distributed File System) is used to store big data in a Distributed Environment, Map Reduce is a programming model used to process the distributed data parallel in a cluster. Hadoop works like Master Slave architecture with a cluster of commodity hardware as shown in the fig 1.1. We will discuss Hadoop related terminology now 1. HDFS Cluster: it is the name given to this whole configuration of master and slaves where the data is stored.we will discuss Master and Slave nodes in detail later in section Homogeneous Cluster: where all the nodes are of similar type of hardware configurations. 3. Heterogeneous Cluster: where all the nodes are of different type of hardware configurations like in a Cloud. 4. Rack: All the nodes are connected to a single LAN. 5. File System: On top of that distributed file system (UNIX), Hadoopprovides an API for processing all that stored data - Map-Reduce. II. HDFS FEATURES Highly Fault Tolerant: By replicating the data to a name node, as we are using commodity hardware so probability of failing is more. So to highly availability we need to replicate. Extreme low cost per byte: HDFS uses commodity direct attached storage and shares the cost of the network & computers it runs on with the MapReduce of the Hadoop stack. HDFS is open source software, so that if an organization chooses, it can be used with zero licensing and support costs. This cost advantage lets organizations store and process orders of magnitude more data per dollar than tradition SAN or NAS systems, which is the price point of many of these other systems. 99 P a g e

4 Very high bandwidth: HDFS can deliver data into the compute infrastructure at a huge data rate, which is often a requirement of big data workloads. HDFS can easily exceed 2 gigabits per second per computer into the mapreduce layer, on a very low cost shared network. Homogeneous/Heterogeneous HDFS Cluster: Disk Storage. Rack-1 Rack-2. Rack-n Fig 1.3: HDFS Cluster Setup 2.1 Hadoop Architecture Lets discuss the components of hadoop in detail and discuss the storage and processing hadoop map reduce jobs. The following are the hadoop s first generation components. Slaves Task Tracker: Process data on name node Data Node: Manage the piece of data Master Job Tracker: Break the bigger task into smaller pieces of Computation. Name Node: To manage and maintain Metadata. Secondary Name Node: Back up node These Master and Slave services are called Daemons run in background. Every master service can talk to each other and every slave service can talk to each other. Similarly name node can talk to data node and vice versa. Job Tracker can talk to Task Tracker and vice versa. All these communications are possible through heart beats. Name Node: Master of the System. Maintains and manages the blocks which are present on the data nodes. Data Node: Slaves which are deployed on each machine and provide the actual storage. It is responsible for serving read and write requests for the clients. Name Node should not be Commodity hardware. Client: is an application which is used to interact with both name node and data node. Job Tracker: send the computations to data nodes Task Tracker: process the data using map reduce programming model. 100 P a g e

5 Metadata Hadoop Cluster Master Name Node Client: Requests Job Tracker Submitting the job Empty DN Blocks Data Node R Data Node Data Node Data Node Slave Task Tracker Task Tracker Tas k Tracker Task Tracker Rack-1 Rack-2 Rack-3 Rack-4 R- Replications DN- Data Node Fig 1.4: Master Slave Architecture Client communicates Data Node and Name Node through SSH (Secure Shell). Minimum number of replicas is 3; it can be defined by user by changing dfs.replication file. Name Node stores Meta data in RAM, actual data is stored in Data Nodes. 2.2 HDFS Work Flow Step 1: DFS is a place where user copies the files (Log files, etc) Step 2: user submitting the job (some kind of analysis need to be done). Step1 DFS Input Splits Maps Reduce Step 3 step 4& 5 step 8 Step 2 Client USER step 9 Step 6 Job Tracker Step 7 Job Queue Fig 1.5: Job Tracker work Step 3: Client tries to get the information from the DFS where the files are located, which rack, which data block and etc., about the file (supplied by the Name Node). 101 P a g e

6 Step 4: Bigger job is split into smaller jobs using an interface called RecordReader. Step 5: Upload job information to DFS (job.xml, job.jar). Step 6: Job queue to the Job Tracker Step 7 and 8: Initialize job and Read job files Step 9: Create maps and reducers. Maps and Reduce are the programs can be written in programming language (JAVA) or scripting language (PYTHON, RUBY). Map processes the data first and generates the key value pair. On this reducer will work and gives output. Shuffle, Sort, Combiners and Partitioner are the intermediate phases available before the result sent to reducer. We will discuss in detail in the next section Map Reduce work flow The total file is divided blocks of 64MB by default. Let us say the input file is of 1GB, then HDFS will provide 16 blocks of each 64MB block. These 16 blocks are called Input Splits. The number of Map functions is equal to the number of input splits. Map programs gives the output key-value pair and can pass to Combiner. Combiner is a mini reducer performs shuffle and sort on a single mapper. Shuffle: places all the related key value pairs together. Sorting: make the key value pairs in sorting order. Reducer: after shuffle and sort the next pass is reducer, there can be one or more reducers. each reducer gives one output file named part Fig 1.6: Map Reduce work flow 102 P a g e

7 Partitioner: output of shuffle and sort phase key value pairs can be send to the intended reducer. This will enhance the performance of map reduce jobs. 2.4 Logicalviews of Map reduce flow The Map and Reduce functions of MapReduce are both defined with respect to data structured in (key, value) pairs. Map takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain: Map (k1,v1) list(k2,v2) The Map function is applied in parallel to every pair in the input dataset. This produces a list of pairs for each call. After that, the MapReduce framework collects all pairs with the same key from all lists and groups them together, creating one group for each key. The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain: Reduce (k2, list (v2)) list(v3) Each Reduce call typically produces either one value v3 or an empty return, though one call is allowed to return more than one value. The returns of all calls are collected as the desired result list. III. PERFORMANCE ISSUES There are some areas where we can do research in improving the performance of hadoop jobs by tuning up various parameters like block sizes, sorting algorithms, data distribution algorithms etc. some of them are identified and tried to explain the solutions to achieve the benchmarks like tera sort and read write sort etc. 1. Sort and Shuffle Optimizations techniques are required to enhance the speed up factor in map reduce engine. 2. An efficient combiner that reduces the network overhead. 3. An efficient way to manage and maintain Meta data information by the name node. 4. By tuning up the block sizes, all the nodes will have equal performance of read/writes. 5. Better utilization of partitioner which enhances the performance by sending huge set of key value pairs to the intended reducer whose configuration is best in the heterogeneous network. 6. Self-learning capability to the nodes for immediately responds to the failure of a node then schedules the job to the nearest node in a cluster. So reducer waiting can be reduced. 7. A framework to find out the network topology in a cluster and can do trouble shooting that which data node is working slowly etc. 8. By altering the parameters of hadoop configuration files to achieve performance. 9. HDFS Block Placement Policy[5] 10. A framework to reach tera sort [6] and read write sort benchmarks. IV. TUNING THE PARAMETERS TO INCREASE THE PERFORMANCE 1. Sort and Shuffle Optimizations techniques are required to enhance the speed up factor in map reduce engine. Sorting algorithms to be designed such a way that in-memory based algorithms to be developed. External Sorting algorithms results wastage of network bandwidth by moving the key value pairs in 103 P a g e

8 between RAM and hard disk or towards HDFS in other nodes or other racks in a cluster. The goal is to compute the sorting in memory only. 2. Manage and maintain Meta data information by the name node: Small files are a big problem in Hadoop. A small file is one which is significantly smaller than the HDFS block size (default 64MB). If you re storing small files, then you probably have lots of them (otherwise you wouldn t turn to Hadoop), and the problem is that HDFS can t handle lots of files. Furthermore, HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from data node to data node to retrieve each small file, all of which is an inefficient data access pattern. All these files Meta data is very difficult to maintain. As the Meta data is stored in RAM, even though the name node is not commodity type of hardware still face some problems in storing and maintaining. This is where we need to keep our eyes. Compare a 1GB file broken into 16 64MB blocks, and 10,000 or so 100KB files. The 10,000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file. There are a couple of features to help reduce the bookkeeping overhead. Hadoop Archives (HAR files) were introduced to HDFS in to alleviate the problem of lots of files putting pressure on the name node s memory. Study different solutions and try to implement and work out for your own solution. 3. Tuning the block sizes: HDFS blocks are large, the default block size is 64MB compared to the block size in other structured-file system 4KB-8KB. The major contribution of this work is Hadoop performance evaluation on writing/reading with variable block sizes that results low performance and full understanding of Map Reduce concept as well and the distribution of jobs.at any stage hadoop clusters can run with a default block sizes, which is fixed (multiples of 64MB). Here reading/writing of such huge filecan vary from one node to other depends on commodity hardware. We are taking only one parameter that is Disks Speed. Traditions Hard Disks speed could be MB/sec, now we have come up with Solid state Disk that could run with a speed of MB/sec. That means the hard disk which have high performance will read the entire file and computes fast and others will compute slow. The Job Tracker should wait for allcomputations to return final result to the application. This would causethe performance issue. So by altering the block sizes i.e. for the high performance node havelarge block size and low performance node have small block size. By tuning up the block sizes, all the nodes will have equal performance of read/writes. So that master node will not wait for other nodes for collecting the computations and it leads tobetter performance. 2. Data Placement Policies: When the client requests for a few blocks, always the name node gives the free HDFS blocks which are nearer to it. Hence most of the requests will be received by the nearest node every time. hence reducing the performance of node by always serving to requests rather doing computation. We need to design data placement algorithms in such a way that requests to be sending to the nodes which are not only nearer but also considering some other metrics that all the nodes in a cluster should have equal number of HDFS blocks. 104 P a g e

9 V. CONCLUSION As we are living in a digital data world one should be aware of how to store and process such huge amount of data. We have defined the Hadoop framework for dealing Big Data in different sections under Introduction. Some of the research issues are defined like data placement policies, HDFS block size tuning, efficient sort and shuffle mechanism to be implemented, managing meta data efficiently etc. Some of the ideas are given to solve all these issues in brief are discussed under section 3. Our future work is to run experiments for all the issues defined individually in various papers and display results using performance measuring tools [7]. REFERENCES Journal Papers: [1] Wikipedia. Big Data. data. [3] HDFS User guide, August [4] Map/reduce tutorial, [5] Exploring HDFS Block Placement Policy, [6] Terasort, Books: [2] Hadoop the Definitive Guide O REILLY, Yahoo press, TOM White Theses: [7] PERFORMANCE ENHANCEMENT OF BIG DATA PROCESSING IN HADOOP MAP/REDUCE by PRATEEK GARG, IIT Bombay. 105 P a g e

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Task Scheduling in Hadoop

Task Scheduling in Hadoop Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

Survey on Scheduling Algorithm in MapReduce Framework

Survey on Scheduling Algorithm in MapReduce Framework Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India

More information

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

Storage and Retrieval of Data for Smart City using Hadoop

Storage and Retrieval of Data for Smart City using Hadoop Storage and Retrieval of Data for Smart City using Hadoop Ravi Gehlot Department of Computer Science Poornima Institute of Engineering and Technology Jaipur, India Abstract Smart cities are equipped with

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Processing of Hadoop using Highly Available NameNode

Processing of Hadoop using Highly Available NameNode Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

Reduction of Data at Namenode in HDFS using harballing Technique

Reduction of Data at Namenode in HDFS using harballing Technique Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu vgkorat@gmail.com swamy.uncis@gmail.com Abstract HDFS stands for the Hadoop Distributed File System.

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Journal of Environmental Science, Computer Science and Engineering & Technology

Journal of Environmental Science, Computer Science and Engineering & Technology JECET; March 2015-May 2015; Sec. B; Vol.4.No.2, 202-209. E-ISSN: 2278 179X Journal of Environmental Science, Computer Science and Engineering & Technology An International Peer Review E-3 Journal of Sciences

More information

Apache Hadoop FileSystem and its Usage in Facebook

Apache Hadoop FileSystem and its Usage in Facebook Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Analysis and Modeling of MapReduce s Performance on Hadoop YARN Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and

More information

The Hadoop Framework

The Hadoop Framework The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen nils.braden@mni.fh-giessen.de Abstract. The Hadoop Framework offers an approach to large-scale

More information

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics BIG DATA WITH HADOOP EXPERIMENTATION HARRISON CARRANZA Marist College APARICIO CARRANZA NYC College of Technology CUNY ECC Conference 2016 Poughkeepsie, NY, June 12-14, 2016 Marist College AGENDA Contents

More information

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani Big Data and Hadoop Sreedhar C, Dr. D. Kavitha, K. Asha Rani Abstract Big data has become a buzzword in the recent years. Big data is used to describe a massive volume of both structured and unstructured

More information

Big Application Execution on Cloud using Hadoop Distributed File System

Big Application Execution on Cloud using Hadoop Distributed File System Big Application Execution on Cloud using Hadoop Distributed File System Ashkan Vates*, Upendra, Muwafaq Rahi Ali RPIIT Campus, Bastara Karnal, Haryana, India ---------------------------------------------------------------------***---------------------------------------------------------------------

More information

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing Big Data Rethink Algos and Architecture Scott Marsh Manager R&D Personal Lines Auto Pricing Agenda History Map Reduce Algorithms History Google talks about their solutions to their problems Map Reduce:

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

Apache Hadoop FileSystem Internals

Apache Hadoop FileSystem Internals Apache Hadoop FileSystem Internals Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Storage Developer Conference, San Jose September 22, 2010 http://www.facebook.com/hadoopfs

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters A Framework for Performance Analysis and Tuning in Hadoop Based Clusters Garvit Bansal Anshul Gupta Utkarsh Pyne LNMIIT, Jaipur, India Email: [garvit.bansal anshul.gupta utkarsh.pyne] @lnmiit.ac.in Manish

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

<Insert Picture Here> Big Data

<Insert Picture Here> Big Data Big Data Kevin Kalmbach Principal Sales Consultant, Public Sector Engineered Systems Program Agenda What is Big Data and why it is important? What is your Big

More information

Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers

Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,

More information

BIG DATA TECHNOLOGY. Hadoop Ecosystem

BIG DATA TECHNOLOGY. Hadoop Ecosystem BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big

More information

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

More information

GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

Hadoop Distributed Filesystem. Spring 2015, X. Zhang Fordham Univ.

Hadoop Distributed Filesystem. Spring 2015, X. Zhang Fordham Univ. Hadoop Distributed Filesystem Spring 2015, X. Zhang Fordham Univ. MapReduce Programming Model Split Shuffle Input: a set of [key,value] pairs intermediate [key,value] pairs [k1,v11,v12, ] [k2,v21,v22,

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

http://bradhedlund.com/?p=3108

http://bradhedlund.com/?p=3108 http://bradhedlund.com/?p=3108 This article is Part 1 in series that will take a closer look at the architecture and methods of a Hadoop cluster, and how it relates to the network and server infrastructure.

More information

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

A Novel Cloud Based Elastic Framework for Big Data Preprocessing School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview

More information

Data-intensive computing systems

Data-intensive computing systems Data-intensive computing systems Hadoop Universtity of Verona Computer Science Department Damiano Carra Acknowledgements! Credits Part of the course material is based on slides provided by the following

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction

More information

BIG DATA USING HADOOP

BIG DATA USING HADOOP + Breakaway Session By Johnson Iyilade, Ph.D. University of Saskatchewan, Canada 23-July, 2015 BIG DATA USING HADOOP + Outline n Framing the Problem Hadoop Solves n Meet Hadoop n Storage with HDFS n Data

More information

CS54100: Database Systems

CS54100: Database Systems CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics

More information

Management & Analysis of Big Data in Zenith Team

Management & Analysis of Big Data in Zenith Team Management & Analysis of Big Data in Zenith Team Zenith Team, INRIA & LIRMM Outline Introduction to MapReduce Dealing with Data Skew in Big Data Processing Data Partitioning for MapReduce Frequent Sequence

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed

More information

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop ISSN: 2454-2377, October 2015 Big Data and Hadoop Simmi Bagga 1 Satinder Kaur 2 1 Assistant Professor, Sant Hira Dass Kanya MahaVidyalaya, Kala Sanghian, Distt Kpt. INDIA E-mail: simmibagga12@gmail.com

More information

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be

More information

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra

More information

Maximizing Hadoop Performance with Hardware Compression

Maximizing Hadoop Performance with Hardware Compression Maximizing Hadoop Performance with Hardware Compression Robert Reiner Director of Marketing Compression and Security Exar Corporation November 2012 1 What is Big? sets whose size is beyond the ability

More information

Solving Big Data Problem using Hadoop File System (HDFS)

Solving Big Data Problem using Hadoop File System (HDFS) International Journal of Applied Information Systems (IJAIS) ISSN : 2249-0868 Foundation of Computer Science FCS, New York, USA International Conference and Workshop on Communication, Computing and Virtualization

More information

Intro to Map/Reduce a.k.a. Hadoop

Intro to Map/Reduce a.k.a. Hadoop Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by

More information

HADOOP PERFORMANCE TUNING

HADOOP PERFORMANCE TUNING PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The

More information

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Dharmendra Agawane 1, Rohit Pawar 2, Pavankumar Purohit 3, Gangadhar Agre 4 Guide: Prof. P B Jawade 2

More information

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013 Hadoop Scalable Distributed Computing Claire Jaja, Julian Chan October 8, 2013 What is Hadoop? A general-purpose storage and data-analysis platform Open source Apache software, implemented in Java Enables

More information

From GWS to MapReduce: Google s Cloud Technology in the Early Days

From GWS to MapReduce: Google s Cloud Technology in the Early Days Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop

More information

This article is the second

This article is the second This article is the second of a series by Pythian experts that will regularly be published as the Performance Corner column in the NoCOUG Journal. The main software components of Oracle Big Data Appliance

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

Hadoop on a Low-Budget General Purpose HPC Cluster in Academia

Hadoop on a Low-Budget General Purpose HPC Cluster in Academia Hadoop on a Low-Budget General Purpose HPC Cluster in Academia Paolo Garza, Paolo Margara, Nicolò Nepote, Luigi Grimaudo, and Elio Piccolo Dipartimento di Automatica e Informatica, Politecnico di Torino,

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Analysis of MapReduce Algorithms

Analysis of MapReduce Algorithms Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model

More information

MapReduce Online. Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears. Neeraj Ganapathy

MapReduce Online. Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears. Neeraj Ganapathy MapReduce Online Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears Neeraj Ganapathy Outline Hadoop Architecture Pipelined MapReduce Online Aggregation Continuous

More information

Apache HBase. Crazy dances on the elephant back

Apache HBase. Crazy dances on the elephant back Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage

More information

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb

More information

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),

More information

Introduction to DISC and Hadoop

Introduction to DISC and Hadoop Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and

More information

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex

More information

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT Samira Daneshyar 1 and Majid Razmjoo 2 1,2 School of Computer Science, Centre of Software Technology and Management (SOFTEM),

More information

http://www.wordle.net/

http://www.wordle.net/ Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

More information

Entering the Zettabyte Age Jeffrey Krone

Entering the Zettabyte Age Jeffrey Krone Entering the Zettabyte Age Jeffrey Krone 1 Kilobyte 1,000 bits/byte. 1 megabyte 1,000,000 1 gigabyte 1,000,000,000 1 terabyte 1,000,000,000,000 1 petabyte 1,000,000,000,000,000 1 exabyte 1,000,000,000,000,000,000

More information