THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCE COMPARING HADOOPDB: A HYBRID OF DBMS AND MAPREDUCE TECHNOLOGIES WITH THE DBMS POSTGRESQL

Size: px
Start display at page:

Download "THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCE COMPARING HADOOPDB: A HYBRID OF DBMS AND MAPREDUCE TECHNOLOGIES WITH THE DBMS POSTGRESQL"

Transcription

1 THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCE COMPARING HADOOPDB: A HYBRID OF DBMS AND MAPREDUCE TECHNOLOGIES WITH THE DBMS POSTGRESQL By VANESSA CEDENO A Dissertation submitted to the Department of Computer Science in partial fulfillment of the requirements for the degree of Master of Computer Science Degree Awarded: Spring Semester, 2010

2 The members of the committee approve the dissertation of Vanessa Cedeno defended on April 15, Feifei Li Professor Directing Dissertation Zhenghao Zhang Committee Member Piyush Kumar Committee Member The Graduate School has verified and approved the above-named committee members. ii

3 I dedicated this to my parents. iii

4 ACKNOWLEDGEMENTS I would like to acknowledge and thank my major professor Dr. Feifei Li for all his advice and guidance. I would also like to thank Dr. Zhenghao Zhang, along with Dr. Piyush Kumar for their willingness to serve on my committee. Also, I would like to thank my friend Daniel Gomez for his advice and feedback throughout all the various stages of the project. Lastly, I would like to thank the entire faculty and staff of the Computer Science Department, Florida State University. iv

5 TABLE OF CONTENTS List of Tables... vi List of Figures... vii Abstract... viii 1. INTRODUCTION PROBLEM DEFINITION Parallel DBMS MapReduce HADOOP HADOOPDB Installing HadoopDB RESULTS Data Loading Grep Task Selection Task Aggregation Task Join Task CONCLUSION REFERENCES BIOGRAPHICAL SKETCH v

6 LIST OF TABLES 3.1 Differences between parallel databases and Hadoop Summary of results vi

7 LIST OF FIGURES 5.1 Grep, Selection and Aggregation Tasks Selection Task vii

8 ABSTRACT This project presents and evaluates the performance of HadoopDB compared to the DBMS PostgreSQL. Currently, the amount of data is expanding and the analysis of this information can be done efficiently with hundreds of machines working in parallel. For analytical data management of a considerable amount of data, HadoopDB tries to combine the efficiency and performance of parallel databases with the scalability, fault tolerance and flexibility of MapReduce. Parallel databases consist of analytical DBMS systems that deploy on a shared-nothing architecture. The data analysis includes scan operations, multidimensional aggregation and star schema joins. MapReduce is a software framework introduced by Google to support distributed computing on large data sets on clusters of computers. Because the data analyzed is growing considerably MapReduce based systems are proposed as an alternative to manage thousands of nodes in a shared-nothing architecture, but it lacks the characteristics to analyze structure data. HadoopDB is a project from Yale University proposed as a hybrid system using MapReduce as the communication layer between multiple nodes, each of them running DBMS instances. In this paper, we compare the improvement in the performance of different types of queries using HadoopDB with PostgreSQL as the database layer, with using just PostgreSQL as the tool for the same data analysis. viii

9 CHAPTER 1 INTRODUCTION Each year approximately 5 Exabyte, 1018 bytes, of new information is produced [8]. One reason for this is that technology drives storage demand. Enterprises are always in the search of a system that can manage, process and granularly analyze their data. But parallel databases do not scale well into the hundreds of nodes. The reasons are failures while increasing the nodes, also there is the assumption from databases of a homogeneous array of machines. Also parallel databases have not been tested on thousands of nodes, but with the increasing in data this will be necessary in the future. To solve these problems a new hybrid system called HadoopDB was proposed. It takes advantage of the MapReduce programming model developed to scale thousands of nodes in a shared-nothing architecture, while having the performance and efficiency advantages of parallel databases [1]. Queries expressed in SQL are translated into MapReduce for the distribution between the nodes and then manage by the higher performance single node databases. HadoopDB is build out of open source components. Hadoop, the open source version of MapReduce, is used to manage the communication between the increasing nodes and the database layer is managed by PostgreSQL. PostgreSQL is a powerful, open source object-relational database system. It has more than 15 years of active development and a proven architecture that has earned it a strong reputation for reliability, data integrity, and correctness [7]. This hybrid system is evaluated to determine how much there is an improvement in performance comparing it to a DBMS, in this case using just PostgreSQL. In chapter 2 the problem is defined, describing the parallel DBMS and MapReduce technologies. In chapter 3, the Hadoop project is explained. In chapter 4, the HadoopDB system and the installation process is presented. In chapter 5 the DBMS PostgreSQL is described. In chapter 6 the results of the experiments are presented. In chapter 7 the conclusion of the project is given. 1

10 CHAPTER 2 PROBLEM DEFINITION HadoopDB is a hybrid system that takes advantage of two existing technologies, DBMS and MapReduce. It targets analytical workloads and is designed to run on a shared-nothing cluster of commodity machines. It tries to offers a free and open source parallel DBMS. It is scalable as Hadoop, while achieving superior performance on structured data analysis workloads. 2.1 Parallel DBMS A parallel database system tries to improve performance through parallelization of various operations, such as loading data, building indexes and evaluating queries. Although data may be stored in a distributed fashion, the distribution is managed just by performance considerations. Parallel databases improve processing and input and output speeds by using multiple CPUs and disks in parallel [5]. Analyzing large scale data workloads is the advantage of using parallel databases. There exist automated tools that can design, deploy, tune and maintain these databases. But parallel databases do not work well with fault tolerance or working with a heterogeneous environment. The main reason for this is that for parallel databases a failure is a rare event and a typical large cluster is a dozen of nodes and not hundreds of thousands. 2.2 MapReduce MapReduce is a programming model that specifies a map function that processes a key-value pair to generate a set of intermediate key-value pairs. First a set of map tasks are processed in parallel individually by each node. Then data is distributed across all the nodes in the cluster. Then in parallel each nodes takes the partition he had received and performs the Reduce task. A reduce function merges all intermediate values associated with the same intermediate key [9]. 2

11 MapReduce has the advantage to checkpoint the output of each Map task to local disk to recover while facing a failure. This is important in terms of fault tolerance and operating in heterogeneous environment properties because of the redundantly executed task. If a failure is detected Map tasks are reassigned to other live nodes [1]. Because MapReduce requires that the user develop a certain structure for the existing data it is ideal to combine it advantages with the performance of parallel databases. 3

12 CHAPTER 3 HADOOP The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. Hadoop is a MapReduce implementation for processing large data sets over 1000s of nodes [6]. Hadoop includes various subprojects like Hadoop Common, Avro, Chukwa, HBase, HDFS, Hive, MapReduce, Pig and ZooKeeper. HadoopDB uses Hadoop Common because these are the utilities that support the other Hadoop subprojects [2]. Also, MapReduce based systems do not accept SQL commands, but Hive is a data warehouse infrastructure that provides data summarization and ad hoc querying [4]. Hadoop implements MapReduce, and in addition, it provides a distributed file system, HDFS, that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both MapReduce and the distributed file system are designed so that node failures are automatically handled by the framework. Table 3.1: Differences between parallel databases and hadoop [1]. Parallel Database MapReduce Data Designed for structured, relational data Designed for unstructured, data Query Interface SQL MapReduce programs written in a variety of languages, including SQL Query Execution Pipelines results between operator Materializes results between Map and Reduce phases Job Granularity Entire query Determined by data storage block size The table shows that parallel databases provide a high performance, scalability is not a property. While MapReduce is scalable but high performance is not one of his strengths. It will be ideal to combine the scalability and high performance in one tool. 4

13 CHAPTER 4 HADOOPDB HadoopDB is a hybrid of DBMS and MapReduce technologies that targets analytical workloads [3]. HadoopDB gives Hadoop access to multiple single-node DBMS servers deployed across the cluster, in this project is PostgreSQL. HadoopDB pushes as much as possible data processing into the database engine by issuing SQL queries to create a system that resembles a shared-nothing parallel database. Applying techniques taken from the database world leads to a performance boost and because HadoopDB relies on MapReduce framework it guaranties scalability, fault and heterogeneity tolerance similar to Hadoop. The Hadoop framework on HadoopDB consist of two layers, the Hadoop Distributed File System or HDFS and the MapReduce Framework. HDFS is a blockstructured file system managed by a central NameNode. Individual files are broken into blocks of a fixed size and distributed across multiple DataNodes in the cluster. The NameNode maintains metadata about the size and location of blocks and their replicas. The MapReduce Framework follows a master-slave architecture. The master is a single JobTracker and the slaves or worker nodes are TaskTrackers. The JobTracker handles the runtime scheduling of MapReduce jobs and maintains information on each TaskTracker s load and available resources. Each job is broken down into Map tasks based on the number of data blocks that require processing, and Reduce tasks. The JobTracker assigns tasks to TaskTrackers based on locality and load balancing. 4.1 Installing HadoopDB There are not many tutorials that help with the installation of HadoopDB, the official tutorial is very helpful but doesn t provide a detailed process. The following tutorial will help future students to install HadoopDB in a faster way. First the required software includes JavaTM 1.6.x, Hadoop and the single node databases installed on each slave node in the cluster. In our project the database is PostgreSQL. 5

14 After unpacking the downloaded Hadoop distribution. In the distribution, edit the file conf/hadoop-env.sh to define JAVA_HOME to be the root of your Java installation. export JAVA_HOME=/usr/ To start the Hadoop cluster in fully distributed mode we have to configure the file conf/hadoop-site.xml. The fs.default.name value specifies the name of the machine that will become the master node. The mapred.job.tracker value specifies the name of the machine that will become the job tracker, it could be the same master node but in our case it will be a different node. The dfs.replication value indicates how many time the data is going to be replicated through the nodes, in our project is 1. Now check that you can ssh to the localhost without a passphrase. Also, the file conf/slaves need to include all the slave nodes that are going to be part of the cluster. To start a Hadoop cluster you will need to start both the HDFS and MapReduce: bin/hadoop namenode format To start the HDFS the following command needs to run on the designated NameNode, the script also consults the conf/slaves file on the NameNode and starts the DataNode daemon on all the listed slaves: $ bin/start-dfs.sh To stop the HDFS the following command needs to run on the designated NameNode, the script also consults the conf/slaves file on the NameNode and stops the DataNode daemon on all the listed slaves: $ bin/stop-dfs.sh To start MapReduce the following command needs to run on the designated JobTracker. The script also consults the conf/slaves file on the JobTracker and starts the TaskTracker daemon on all the listed slaves: 6

15 bin/start-mapred.sh To stop MapReduce the following command needs to run on the designated JobTracker. The script also consults the conf/slaves file on the JobTracker and stops the TaskTracker daemon on all the listed slaves: bin/stop-mapred.sh After installing Hadoop, HadoopDB's binary version, hadoopdb.jar, needs to placed under HADOOP_HOME/lib on each node. The JDBC jar driver is also required and needs to be placed in the lib directory. For Hadoop to recognize these changes, a restart may be required. You should stop and then start Hadoop. After installing PostgreSQL the pg_hba.conf file should be edited as follows: local all all trust host all all /0 password host all all ::1/128 trust host all all xxx.xxx.xxx.x password Then edit the file postgresql.conf to be as follows: listen_addresses = '*' port = 5432 shared_buffers = 512MB work_mem = 1024MB # what IP address(es) to listen on; # (change requires restart) Then, restart Postgres using pg_ctl restart, and if PostgreSQL refuses to start with such large memory settings, you may need to run: sysctl -w kernel.shmmax=

16 This setting should also be added to /etc/sysctl.conf. The SMS Planner setup refers to the SQL to MapReduce to SQL, SMS. The planner consists of a slightly modified Hive build and SMS specific classes. Installing Hive is easy, and after the installation the command HIVE_HOME/bin/hive will run it. But before executing any HadoopDB jobs the data needs to be prepare and loaded. First, data need to be loaded into HDFS: bin/hadoop dfs -put file.txt name_of_the_file_on_hdfs.txt If you want to see the data in HDFS to make sure the loading was effective: bin/hadoop fs lsr / Then, a custom-made Hadoop job, GlobalHasher, repartitions data into a specified number of partitions. In our case we were working with 8 nodes: bin/hadoop jar /usr/local/hadoop /lib/hadoopdb.jar edu.yale.cs.hadoopdb.dataloader.globalhasher /file.txt /parts/ 6 \ 0 Each partition should be downloaded into a node's local file system, for example: bin/hadoop fs -get /parts/ part /usr/local/hadoop /p0 Then we created a table in PostgreSQL on each node: CREATE TABLE grep ( userid int, NAME varchar(300), rating int); Each chunked file is then bulk-loaded into a separate database within a node and indexed appropriately: 8

17 COPY grep FROM '/usr/local/hadoop /p0' WITH DELIMITER E'\t'; The HadoopDB Catalog needs to reflect the location of all chunks. In the current implementation the catalog is an XML file stored in HDFS. The SimpleCatalogGenerator tool assumes uniformly distributed chunks across all nodes, with each node having the exact same number of chunks. It associates each relation's chunks with a partition id and allows the existence of chunked and complete relations per node. The following command to generate the HadoopDB.xml catalog: java -cp hadoopdb.jar edu.yale.cs.hadoopdb.catalog.simplecataloggenerator path_to_catalog.properties generation: The Catalog.properties file contains parameters necessary for Catalog nodes_file=machines.txt #specifies the nodes in the cluster relations_unchunked=grep, EntireRankings #specifies the table names relations_chunked=rankings, UserVisits #specifies the table names After generating the catalog, upload it to HDFS: hadoop dfs -put HadoopDB.xml HadoopDB.xml If you modify an existing file of HaddopDB.xml file don t forget to remove it first from the HDFS before uploading it again: bin/hadoop dfs -rmr HadoopDB.xml Before any query execution, we update the MetaStore with references to our database tables. Hive allows tables to exist externally, outside HDFS. 9

18 CREATE EXTERNAL TABLE grep( userid int, NAME string, rating int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS INPUTFORMAT 'edu.yale.cs.hadoopdb.sms.connector.smsinputformat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.hiveignorekeytextoutputfor mat' LOCATION '/db/grep'; the path. Note the location path needs to have the relation name as the final directory in 10

19 CHAPTER 5 RESULTS The following experiments compare performance of HadoopDB and PostgreSQL. Hadoop version running on Java was used. The characteristics of the machines used as nodes were the following: Fedora Release 9 Memory 2.0 GiB Processor: Intel (R) Xeon (R) CPU 1.86 GHz Available disk space:22.2 GiB We worked with random data of 560 MB on each node. 5.1 Data Loading Putting the 3.3 GB of data on the HDFS before performing the partition was 291 seconds. While the global hasher took 482 seconds to pslit the data in 6 chunks. 5.2 Grep Task Each record consists of a unique key in the first 10 bytes, followed by a 90-byte character string. The pattern XYZ is searched for in the 90 byte field. Each node contains 560MB of data. HadoopDB and PostgreSQL executed the identical SQL: SELECT * FROM grep WHERE NAME LIKE %xyz% ; None of the benchmarked systems contained an index on the field attribute. This means that for all systems, this query requires a full table scan and is mostly limited by disk speed. HadoopDB s SMS planner pushes the WHERE clause into the PostgreSQL instances. The performance of each system is presented in Figure

20 5.3 Selection Task The first structured data task evaluates a simple selection predicate on the rating attribute from the grep table. There are approximately 36,000 tuples on each node that pass this predicate. SELECT NAME, rating FROM grep WHERE rating >5; HadoopDB s SMS planner pushes the selection and projection clauses into the PostgreSQL instances. The performance of each system is presented in Figure Aggregation Task Unlike the previous tasks, this task requires intermediate results to be exchanged between different nodes in the cluster so that the final aggregate can be calculated. SELECT SUM( rating), NAME FROM grep group by NAME; The SMS planner for HadoopDB pushes the entire SQL query into the PostgreSQL instances. The output is then sent to Reduce jobs inside of Hadoop that perform the final aggregation after collecting all pre-aggregated sums from each PostgreSQL instance. The performance of each system is presented in Figure Join Task The key difference between this task and the previous tasks is that it must read in two different data sets and join them together. We push the selection, join, and partial aggregation into the PostgreSQL instances with the following SQL: SELECT g.name, AVG(h.userid), SUM(g.rating) FROM grep g JOIN grep h ON (g.name=h.name) 12

21 GROUP BY g.name; The performance of each system is presented in Figure Seconds HadoopDB PostgreSQL 50 0 Grep Selection Aggregation Figure 5.1: Grep, Selection and Aggregation Tasks. Seconds Join Task 0 HadoopDB PostgreSQL Figure 5.2: Join Task. 13

22 Table 5.1: Summary of results. Task HadoopDB PostgreSQL Grep 47.4 s 180 s Selection 46.8 s 182 s Aggregation s 231 s Join 1709 s 6494 s 14

23 CHAPTER 6 CONCLUSION The experiments show that PostgreSQL performance is improved by using the hybrid system HadoopDB. The MapReduce based system allows the distribution of the data among the different nodes in the cluster. In this way the load is distributed and the time minimized. Table 5.1 shows that a cluster of 8 nodes running HadoopDB will perform 3.8 times faster than the same SQL performed on PostgreSQL with the same amount of data. HadoopDB is a hybrid of the parallel DBMS and Hadoop technologies to data analysis. It achieves the performance and efficiency of parallel databases, with the scalability, fault tolerance, and flexibility of MapReduce-based systems. The ability of HadoopDB to incorporate Hadoop and open source DBMS software makes HadoopDB flexible and extensible for performing data analysis at the large scales expected of future workloads. This is proven with the experiments comparing the same load of data performed on a single machine. 15

24 REFERENCES [1] A. Abouzeid, K. Bajda-Pawlikowski, and D. Abadi. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. [2] Hadoop. Web Page. [3] HadoopDB. Web Page. [4] Hive. Web Page. [5] Parallel database. Web Page. [6] Apache Hadoop. Web Page. [7] PostgreSQL. Web Page. [8] C. Mullins. The DBA corner. Web Page July [9] MapReduce. Web Page. 16

25 BIOGRAPHICAL SKETCH Vanessa Cedeno In the spring of 2007, Vanessa Cedeno completed her Bachelor s degree in Computer Engineering at ESPOL, Escuela Superior Politecnica del Litoral in Guayaquil, Ecuador. She enrolled in the master program at Florida State University in the fall of Vanessa s research interests include database systems, distributed systems and analysis of algorithms. 17

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

Apache Hadoop new way for the company to store and analyze big data

Apache Hadoop new way for the company to store and analyze big data Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

!#$%&' ( )%#*'+,'-#.//0( !#$%&'()*$+()',!-+.'/', 4(5,67,!-+!89,:*$;'0+$.<.,&0$'09,&)/=+,!()<>'0, 3, Processing LARGE data sets !"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

HadoopRDF : A Scalable RDF Data Analysis System

HadoopRDF : A Scalable RDF Data Analysis System HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn

More information

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster Setup Hadoop On Ubuntu Linux ---Multi-Node Cluster We have installed the JDK and Hadoop for you. The JAVA_HOME is /usr/lib/jvm/java/jdk1.6.0_22 The Hadoop home is /home/user/hadoop-0.20.2 1. Network Edit

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads Azza Abouzeid 1, Kamil Bajda-Pawlikowski 1, Daniel Abadi 1, Avi Silberschatz 1, Alexander Rasin 2 1 Yale University,

More information

TP1: Getting Started with Hadoop

TP1: Getting Started with Hadoop TP1: Getting Started with Hadoop Alexandru Costan MapReduce has emerged as a leading programming model for data-intensive computing. It was originally proposed by Google to simplify development of web

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 Lecture 2 (08/31, 09/02, 09/09): Hadoop Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 K. Zhang BUDT 758 What we ll cover Overview Architecture o Hadoop

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

How To Install Hadoop 1.2.1.1 From Apa Hadoop 1.3.2 To 1.4.2 (Hadoop)

How To Install Hadoop 1.2.1.1 From Apa Hadoop 1.3.2 To 1.4.2 (Hadoop) Contents Download and install Java JDK... 1 Download the Hadoop tar ball... 1 Update $HOME/.bashrc... 3 Configuration of Hadoop in Pseudo Distributed Mode... 4 Format the newly created cluster to create

More information

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information

Running Hadoop On Ubuntu Linux (Multi-Node Cluster) - Michael G...

Running Hadoop On Ubuntu Linux (Multi-Node Cluster) - Michael G... Go Home About Contact Blog Code Publications DMOZ100k06 Photography Running Hadoop On Ubuntu Linux (Multi-Node Cluster) From Michael G. Noll Contents 1 What we want to do 2 Tutorial approach and structure

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment James Devine December 15, 2008 Abstract Mapreduce has been a very successful computational technique that has

More information

The Performance of MapReduce: An In-depth Study

The Performance of MapReduce: An In-depth Study The Performance of MapReduce: An In-depth Study Dawei Jiang Beng Chin Ooi Lei Shi Sai Wu School of Computing National University of Singapore {jiangdw, ooibc, shilei, wusai}@comp.nus.edu.sg ABSTRACT MapReduce

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department

More information

Installation and Configuration Documentation

Installation and Configuration Documentation Installation and Configuration Documentation Release 1.0.1 Oshin Prem October 08, 2015 Contents 1 HADOOP INSTALLATION 3 1.1 SINGLE-NODE INSTALLATION................................... 3 1.2 MULTI-NODE

More information

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea Overview Riding Google App Engine Taming Hadoop Summary Riding

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Accelerating and Simplifying Apache

Accelerating and Simplifying Apache Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly

More information

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010 Hadoop s Entry into the Traditional Analytical DBMS Market Daniel Abadi Yale University August 3 rd, 2010 Data, Data, Everywhere Data explosion Web 2.0 more user data More devices that sense data More

More information

Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

More information

Chase Wu New Jersey Ins0tute of Technology

Chase Wu New Jersey Ins0tute of Technology CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at

More information

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science A Seminar report On Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science SUBMITTED TO: www.studymafia.org SUBMITTED BY: www.studymafia.org

More information

A Comparison of Approaches to Large-Scale Data Analysis

A Comparison of Approaches to Large-Scale Data Analysis A Comparison of Approaches to Large-Scale Data Analysis Sam Madden MIT CSAIL with Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, and Michael Stonebraker In SIGMOD 2009 MapReduce

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

MapReduce Job Processing

MapReduce Job Processing April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Hadoop Distributed Filesystem. Spring 2015, X. Zhang Fordham Univ.

Hadoop Distributed Filesystem. Spring 2015, X. Zhang Fordham Univ. Hadoop Distributed Filesystem Spring 2015, X. Zhang Fordham Univ. MapReduce Programming Model Split Shuffle Input: a set of [key,value] pairs intermediate [key,value] pairs [k1,v11,v12, ] [k2,v21,v22,

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

PARADISE: Big data analytics using the DBMS tightly integrated with the distributed file system

PARADISE: Big data analytics using the DBMS tightly integrated with the distributed file system DOI 10.1007/s11280-014-0312-2 PARADISE: Big data analytics using the DBMS tightly integrated with the distributed file system Jun-Sung Kim Kyu-Young Whang Hyuk-Yoon Kwon Il-Yeol Song Received: 10 June

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately

More information

Entering the Zettabyte Age Jeffrey Krone

Entering the Zettabyte Age Jeffrey Krone Entering the Zettabyte Age Jeffrey Krone 1 Kilobyte 1,000 bits/byte. 1 megabyte 1,000,000 1 gigabyte 1,000,000,000 1 terabyte 1,000,000,000,000 1 petabyte 1,000,000,000,000,000 1 exabyte 1,000,000,000,000,000,000

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab IBM CDL Lab Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网 Information Management 2012 IBM Corporation Agenda Hadoop 技 术 Hadoop 概 述 Hadoop 1.x Hadoop 2.x Hadoop 生 态

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 3. Apache Hadoop Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Apache Hadoop Open-source

More information

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps Distributed Recommenders Fall 2010 Distributed Recommenders Distributed Approaches are needed when: Dataset does not fit into memory Need for processing exceeds what can be provided with a sequential algorithm

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big

More information

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Spring,2015 Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Contents: Briefly About Big Data Management What is hive? Hive Architecture Working

More information

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data

More information

HDFS Users Guide. Table of contents

HDFS Users Guide. Table of contents Table of contents 1 Purpose...2 2 Overview...2 3 Prerequisites...3 4 Web Interface...3 5 Shell Commands... 3 5.1 DFSAdmin Command...4 6 Secondary NameNode...4 7 Checkpoint Node...5 8 Backup Node...6 9

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

Tutorial for Assignment 2.0

Tutorial for Assignment 2.0 Tutorial for Assignment 2.0 Florian Klien & Christian Körner IMPORTANT The presented information has been tested on the following operating systems Mac OS X 10.6 Ubuntu Linux The installation on Windows

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture

More information

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction

More information

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex

More information

THE HADOOP DISTRIBUTED FILE SYSTEM

THE HADOOP DISTRIBUTED FILE SYSTEM THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

MapReduce. Tushar B. Kute, http://tusharkute.com

MapReduce. Tushar B. Kute, http://tusharkute.com MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity

More information

Alternatives to HIVE SQL in Hadoop File Structure

Alternatives to HIVE SQL in Hadoop File Structure Alternatives to HIVE SQL in Hadoop File Structure Ms. Arpana Chaturvedi, Ms. Poonam Verma ABSTRACT Trends face ups and lows.in the present scenario the social networking sites have been in the vogue. The

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Sriram Krishnan, Ph.D. sriram@sdsc.edu

Sriram Krishnan, Ph.D. sriram@sdsc.edu Sriram Krishnan, Ph.D. sriram@sdsc.edu (Re-)Introduction to cloud computing Introduction to the MapReduce and Hadoop Distributed File System Programming model Examples of MapReduce Where/how to run MapReduce

More information

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here> s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline

More information

MapReduce, Hadoop and Amazon AWS

MapReduce, Hadoop and Amazon AWS MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation

1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation 1. GridGain In-Memory Accelerator For Hadoop GridGain's In-Memory Accelerator For Hadoop edition is based on the industry's first high-performance dual-mode in-memory file system that is 100% compatible

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

Hadoop and its Usage at Facebook. Dhruba Borthakur dhruba@apache.org, June 22 rd, 2009

Hadoop and its Usage at Facebook. Dhruba Borthakur dhruba@apache.org, June 22 rd, 2009 Hadoop and its Usage at Facebook Dhruba Borthakur dhruba@apache.org, June 22 rd, 2009 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed on Hadoop Distributed File System Facebook

More information

Hadoop Training Hands On Exercise

Hadoop Training Hands On Exercise Hadoop Training Hands On Exercise 1. Getting started: Step 1: Download and Install the Vmware player - Download the VMware- player- 5.0.1-894247.zip and unzip it on your windows machine - Click the exe

More information

MySQL and Hadoop. Percona Live 2014 Chris Schneider

MySQL and Hadoop. Percona Live 2014 Chris Schneider MySQL and Hadoop Percona Live 2014 Chris Schneider About Me Chris Schneider, Database Architect @ Groupon Spent the last 10 years building MySQL architecture for multiple companies Worked with Hadoop for

More information