THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCE COMPARING HADOOPDB: A HYBRID OF DBMS AND MAPREDUCE TECHNOLOGIES WITH THE DBMS POSTGRESQL
|
|
- Amice Cannon
- 8 years ago
- Views:
Transcription
1 THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCE COMPARING HADOOPDB: A HYBRID OF DBMS AND MAPREDUCE TECHNOLOGIES WITH THE DBMS POSTGRESQL By VANESSA CEDENO A Dissertation submitted to the Department of Computer Science in partial fulfillment of the requirements for the degree of Master of Computer Science Degree Awarded: Spring Semester, 2010
2 The members of the committee approve the dissertation of Vanessa Cedeno defended on April 15, Feifei Li Professor Directing Dissertation Zhenghao Zhang Committee Member Piyush Kumar Committee Member The Graduate School has verified and approved the above-named committee members. ii
3 I dedicated this to my parents. iii
4 ACKNOWLEDGEMENTS I would like to acknowledge and thank my major professor Dr. Feifei Li for all his advice and guidance. I would also like to thank Dr. Zhenghao Zhang, along with Dr. Piyush Kumar for their willingness to serve on my committee. Also, I would like to thank my friend Daniel Gomez for his advice and feedback throughout all the various stages of the project. Lastly, I would like to thank the entire faculty and staff of the Computer Science Department, Florida State University. iv
5 TABLE OF CONTENTS List of Tables... vi List of Figures... vii Abstract... viii 1. INTRODUCTION PROBLEM DEFINITION Parallel DBMS MapReduce HADOOP HADOOPDB Installing HadoopDB RESULTS Data Loading Grep Task Selection Task Aggregation Task Join Task CONCLUSION REFERENCES BIOGRAPHICAL SKETCH v
6 LIST OF TABLES 3.1 Differences between parallel databases and Hadoop Summary of results vi
7 LIST OF FIGURES 5.1 Grep, Selection and Aggregation Tasks Selection Task vii
8 ABSTRACT This project presents and evaluates the performance of HadoopDB compared to the DBMS PostgreSQL. Currently, the amount of data is expanding and the analysis of this information can be done efficiently with hundreds of machines working in parallel. For analytical data management of a considerable amount of data, HadoopDB tries to combine the efficiency and performance of parallel databases with the scalability, fault tolerance and flexibility of MapReduce. Parallel databases consist of analytical DBMS systems that deploy on a shared-nothing architecture. The data analysis includes scan operations, multidimensional aggregation and star schema joins. MapReduce is a software framework introduced by Google to support distributed computing on large data sets on clusters of computers. Because the data analyzed is growing considerably MapReduce based systems are proposed as an alternative to manage thousands of nodes in a shared-nothing architecture, but it lacks the characteristics to analyze structure data. HadoopDB is a project from Yale University proposed as a hybrid system using MapReduce as the communication layer between multiple nodes, each of them running DBMS instances. In this paper, we compare the improvement in the performance of different types of queries using HadoopDB with PostgreSQL as the database layer, with using just PostgreSQL as the tool for the same data analysis. viii
9 CHAPTER 1 INTRODUCTION Each year approximately 5 Exabyte, 1018 bytes, of new information is produced [8]. One reason for this is that technology drives storage demand. Enterprises are always in the search of a system that can manage, process and granularly analyze their data. But parallel databases do not scale well into the hundreds of nodes. The reasons are failures while increasing the nodes, also there is the assumption from databases of a homogeneous array of machines. Also parallel databases have not been tested on thousands of nodes, but with the increasing in data this will be necessary in the future. To solve these problems a new hybrid system called HadoopDB was proposed. It takes advantage of the MapReduce programming model developed to scale thousands of nodes in a shared-nothing architecture, while having the performance and efficiency advantages of parallel databases [1]. Queries expressed in SQL are translated into MapReduce for the distribution between the nodes and then manage by the higher performance single node databases. HadoopDB is build out of open source components. Hadoop, the open source version of MapReduce, is used to manage the communication between the increasing nodes and the database layer is managed by PostgreSQL. PostgreSQL is a powerful, open source object-relational database system. It has more than 15 years of active development and a proven architecture that has earned it a strong reputation for reliability, data integrity, and correctness [7]. This hybrid system is evaluated to determine how much there is an improvement in performance comparing it to a DBMS, in this case using just PostgreSQL. In chapter 2 the problem is defined, describing the parallel DBMS and MapReduce technologies. In chapter 3, the Hadoop project is explained. In chapter 4, the HadoopDB system and the installation process is presented. In chapter 5 the DBMS PostgreSQL is described. In chapter 6 the results of the experiments are presented. In chapter 7 the conclusion of the project is given. 1
10 CHAPTER 2 PROBLEM DEFINITION HadoopDB is a hybrid system that takes advantage of two existing technologies, DBMS and MapReduce. It targets analytical workloads and is designed to run on a shared-nothing cluster of commodity machines. It tries to offers a free and open source parallel DBMS. It is scalable as Hadoop, while achieving superior performance on structured data analysis workloads. 2.1 Parallel DBMS A parallel database system tries to improve performance through parallelization of various operations, such as loading data, building indexes and evaluating queries. Although data may be stored in a distributed fashion, the distribution is managed just by performance considerations. Parallel databases improve processing and input and output speeds by using multiple CPUs and disks in parallel [5]. Analyzing large scale data workloads is the advantage of using parallel databases. There exist automated tools that can design, deploy, tune and maintain these databases. But parallel databases do not work well with fault tolerance or working with a heterogeneous environment. The main reason for this is that for parallel databases a failure is a rare event and a typical large cluster is a dozen of nodes and not hundreds of thousands. 2.2 MapReduce MapReduce is a programming model that specifies a map function that processes a key-value pair to generate a set of intermediate key-value pairs. First a set of map tasks are processed in parallel individually by each node. Then data is distributed across all the nodes in the cluster. Then in parallel each nodes takes the partition he had received and performs the Reduce task. A reduce function merges all intermediate values associated with the same intermediate key [9]. 2
11 MapReduce has the advantage to checkpoint the output of each Map task to local disk to recover while facing a failure. This is important in terms of fault tolerance and operating in heterogeneous environment properties because of the redundantly executed task. If a failure is detected Map tasks are reassigned to other live nodes [1]. Because MapReduce requires that the user develop a certain structure for the existing data it is ideal to combine it advantages with the performance of parallel databases. 3
12 CHAPTER 3 HADOOP The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. Hadoop is a MapReduce implementation for processing large data sets over 1000s of nodes [6]. Hadoop includes various subprojects like Hadoop Common, Avro, Chukwa, HBase, HDFS, Hive, MapReduce, Pig and ZooKeeper. HadoopDB uses Hadoop Common because these are the utilities that support the other Hadoop subprojects [2]. Also, MapReduce based systems do not accept SQL commands, but Hive is a data warehouse infrastructure that provides data summarization and ad hoc querying [4]. Hadoop implements MapReduce, and in addition, it provides a distributed file system, HDFS, that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both MapReduce and the distributed file system are designed so that node failures are automatically handled by the framework. Table 3.1: Differences between parallel databases and hadoop [1]. Parallel Database MapReduce Data Designed for structured, relational data Designed for unstructured, data Query Interface SQL MapReduce programs written in a variety of languages, including SQL Query Execution Pipelines results between operator Materializes results between Map and Reduce phases Job Granularity Entire query Determined by data storage block size The table shows that parallel databases provide a high performance, scalability is not a property. While MapReduce is scalable but high performance is not one of his strengths. It will be ideal to combine the scalability and high performance in one tool. 4
13 CHAPTER 4 HADOOPDB HadoopDB is a hybrid of DBMS and MapReduce technologies that targets analytical workloads [3]. HadoopDB gives Hadoop access to multiple single-node DBMS servers deployed across the cluster, in this project is PostgreSQL. HadoopDB pushes as much as possible data processing into the database engine by issuing SQL queries to create a system that resembles a shared-nothing parallel database. Applying techniques taken from the database world leads to a performance boost and because HadoopDB relies on MapReduce framework it guaranties scalability, fault and heterogeneity tolerance similar to Hadoop. The Hadoop framework on HadoopDB consist of two layers, the Hadoop Distributed File System or HDFS and the MapReduce Framework. HDFS is a blockstructured file system managed by a central NameNode. Individual files are broken into blocks of a fixed size and distributed across multiple DataNodes in the cluster. The NameNode maintains metadata about the size and location of blocks and their replicas. The MapReduce Framework follows a master-slave architecture. The master is a single JobTracker and the slaves or worker nodes are TaskTrackers. The JobTracker handles the runtime scheduling of MapReduce jobs and maintains information on each TaskTracker s load and available resources. Each job is broken down into Map tasks based on the number of data blocks that require processing, and Reduce tasks. The JobTracker assigns tasks to TaskTrackers based on locality and load balancing. 4.1 Installing HadoopDB There are not many tutorials that help with the installation of HadoopDB, the official tutorial is very helpful but doesn t provide a detailed process. The following tutorial will help future students to install HadoopDB in a faster way. First the required software includes JavaTM 1.6.x, Hadoop and the single node databases installed on each slave node in the cluster. In our project the database is PostgreSQL. 5
14 After unpacking the downloaded Hadoop distribution. In the distribution, edit the file conf/hadoop-env.sh to define JAVA_HOME to be the root of your Java installation. export JAVA_HOME=/usr/ To start the Hadoop cluster in fully distributed mode we have to configure the file conf/hadoop-site.xml. The fs.default.name value specifies the name of the machine that will become the master node. The mapred.job.tracker value specifies the name of the machine that will become the job tracker, it could be the same master node but in our case it will be a different node. The dfs.replication value indicates how many time the data is going to be replicated through the nodes, in our project is 1. Now check that you can ssh to the localhost without a passphrase. Also, the file conf/slaves need to include all the slave nodes that are going to be part of the cluster. To start a Hadoop cluster you will need to start both the HDFS and MapReduce: bin/hadoop namenode format To start the HDFS the following command needs to run on the designated NameNode, the script also consults the conf/slaves file on the NameNode and starts the DataNode daemon on all the listed slaves: $ bin/start-dfs.sh To stop the HDFS the following command needs to run on the designated NameNode, the script also consults the conf/slaves file on the NameNode and stops the DataNode daemon on all the listed slaves: $ bin/stop-dfs.sh To start MapReduce the following command needs to run on the designated JobTracker. The script also consults the conf/slaves file on the JobTracker and starts the TaskTracker daemon on all the listed slaves: 6
15 bin/start-mapred.sh To stop MapReduce the following command needs to run on the designated JobTracker. The script also consults the conf/slaves file on the JobTracker and stops the TaskTracker daemon on all the listed slaves: bin/stop-mapred.sh After installing Hadoop, HadoopDB's binary version, hadoopdb.jar, needs to placed under HADOOP_HOME/lib on each node. The JDBC jar driver is also required and needs to be placed in the lib directory. For Hadoop to recognize these changes, a restart may be required. You should stop and then start Hadoop. After installing PostgreSQL the pg_hba.conf file should be edited as follows: local all all trust host all all /0 password host all all ::1/128 trust host all all xxx.xxx.xxx.x password Then edit the file postgresql.conf to be as follows: listen_addresses = '*' port = 5432 shared_buffers = 512MB work_mem = 1024MB # what IP address(es) to listen on; # (change requires restart) Then, restart Postgres using pg_ctl restart, and if PostgreSQL refuses to start with such large memory settings, you may need to run: sysctl -w kernel.shmmax=
16 This setting should also be added to /etc/sysctl.conf. The SMS Planner setup refers to the SQL to MapReduce to SQL, SMS. The planner consists of a slightly modified Hive build and SMS specific classes. Installing Hive is easy, and after the installation the command HIVE_HOME/bin/hive will run it. But before executing any HadoopDB jobs the data needs to be prepare and loaded. First, data need to be loaded into HDFS: bin/hadoop dfs -put file.txt name_of_the_file_on_hdfs.txt If you want to see the data in HDFS to make sure the loading was effective: bin/hadoop fs lsr / Then, a custom-made Hadoop job, GlobalHasher, repartitions data into a specified number of partitions. In our case we were working with 8 nodes: bin/hadoop jar /usr/local/hadoop /lib/hadoopdb.jar edu.yale.cs.hadoopdb.dataloader.globalhasher /file.txt /parts/ 6 \ 0 Each partition should be downloaded into a node's local file system, for example: bin/hadoop fs -get /parts/ part /usr/local/hadoop /p0 Then we created a table in PostgreSQL on each node: CREATE TABLE grep ( userid int, NAME varchar(300), rating int); Each chunked file is then bulk-loaded into a separate database within a node and indexed appropriately: 8
17 COPY grep FROM '/usr/local/hadoop /p0' WITH DELIMITER E'\t'; The HadoopDB Catalog needs to reflect the location of all chunks. In the current implementation the catalog is an XML file stored in HDFS. The SimpleCatalogGenerator tool assumes uniformly distributed chunks across all nodes, with each node having the exact same number of chunks. It associates each relation's chunks with a partition id and allows the existence of chunked and complete relations per node. The following command to generate the HadoopDB.xml catalog: java -cp hadoopdb.jar edu.yale.cs.hadoopdb.catalog.simplecataloggenerator path_to_catalog.properties generation: The Catalog.properties file contains parameters necessary for Catalog nodes_file=machines.txt #specifies the nodes in the cluster relations_unchunked=grep, EntireRankings #specifies the table names relations_chunked=rankings, UserVisits #specifies the table names After generating the catalog, upload it to HDFS: hadoop dfs -put HadoopDB.xml HadoopDB.xml If you modify an existing file of HaddopDB.xml file don t forget to remove it first from the HDFS before uploading it again: bin/hadoop dfs -rmr HadoopDB.xml Before any query execution, we update the MetaStore with references to our database tables. Hive allows tables to exist externally, outside HDFS. 9
18 CREATE EXTERNAL TABLE grep( userid int, NAME string, rating int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS INPUTFORMAT 'edu.yale.cs.hadoopdb.sms.connector.smsinputformat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.hiveignorekeytextoutputfor mat' LOCATION '/db/grep'; the path. Note the location path needs to have the relation name as the final directory in 10
19 CHAPTER 5 RESULTS The following experiments compare performance of HadoopDB and PostgreSQL. Hadoop version running on Java was used. The characteristics of the machines used as nodes were the following: Fedora Release 9 Memory 2.0 GiB Processor: Intel (R) Xeon (R) CPU 1.86 GHz Available disk space:22.2 GiB We worked with random data of 560 MB on each node. 5.1 Data Loading Putting the 3.3 GB of data on the HDFS before performing the partition was 291 seconds. While the global hasher took 482 seconds to pslit the data in 6 chunks. 5.2 Grep Task Each record consists of a unique key in the first 10 bytes, followed by a 90-byte character string. The pattern XYZ is searched for in the 90 byte field. Each node contains 560MB of data. HadoopDB and PostgreSQL executed the identical SQL: SELECT * FROM grep WHERE NAME LIKE %xyz% ; None of the benchmarked systems contained an index on the field attribute. This means that for all systems, this query requires a full table scan and is mostly limited by disk speed. HadoopDB s SMS planner pushes the WHERE clause into the PostgreSQL instances. The performance of each system is presented in Figure
20 5.3 Selection Task The first structured data task evaluates a simple selection predicate on the rating attribute from the grep table. There are approximately 36,000 tuples on each node that pass this predicate. SELECT NAME, rating FROM grep WHERE rating >5; HadoopDB s SMS planner pushes the selection and projection clauses into the PostgreSQL instances. The performance of each system is presented in Figure Aggregation Task Unlike the previous tasks, this task requires intermediate results to be exchanged between different nodes in the cluster so that the final aggregate can be calculated. SELECT SUM( rating), NAME FROM grep group by NAME; The SMS planner for HadoopDB pushes the entire SQL query into the PostgreSQL instances. The output is then sent to Reduce jobs inside of Hadoop that perform the final aggregation after collecting all pre-aggregated sums from each PostgreSQL instance. The performance of each system is presented in Figure Join Task The key difference between this task and the previous tasks is that it must read in two different data sets and join them together. We push the selection, join, and partial aggregation into the PostgreSQL instances with the following SQL: SELECT g.name, AVG(h.userid), SUM(g.rating) FROM grep g JOIN grep h ON (g.name=h.name) 12
21 GROUP BY g.name; The performance of each system is presented in Figure Seconds HadoopDB PostgreSQL 50 0 Grep Selection Aggregation Figure 5.1: Grep, Selection and Aggregation Tasks. Seconds Join Task 0 HadoopDB PostgreSQL Figure 5.2: Join Task. 13
22 Table 5.1: Summary of results. Task HadoopDB PostgreSQL Grep 47.4 s 180 s Selection 46.8 s 182 s Aggregation s 231 s Join 1709 s 6494 s 14
23 CHAPTER 6 CONCLUSION The experiments show that PostgreSQL performance is improved by using the hybrid system HadoopDB. The MapReduce based system allows the distribution of the data among the different nodes in the cluster. In this way the load is distributed and the time minimized. Table 5.1 shows that a cluster of 8 nodes running HadoopDB will perform 3.8 times faster than the same SQL performed on PostgreSQL with the same amount of data. HadoopDB is a hybrid of the parallel DBMS and Hadoop technologies to data analysis. It achieves the performance and efficiency of parallel databases, with the scalability, fault tolerance, and flexibility of MapReduce-based systems. The ability of HadoopDB to incorporate Hadoop and open source DBMS software makes HadoopDB flexible and extensible for performing data analysis at the large scales expected of future workloads. This is proven with the experiments comparing the same load of data performed on a single machine. 15
24 REFERENCES [1] A. Abouzeid, K. Bajda-Pawlikowski, and D. Abadi. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. [2] Hadoop. Web Page. [3] HadoopDB. Web Page. [4] Hive. Web Page. [5] Parallel database. Web Page. [6] Apache Hadoop. Web Page. [7] PostgreSQL. Web Page. [8] C. Mullins. The DBA corner. Web Page July [9] MapReduce. Web Page. 16
25 BIOGRAPHICAL SKETCH Vanessa Cedeno In the spring of 2007, Vanessa Cedeno completed her Bachelor s degree in Computer Engineering at ESPOL, Escuela Superior Politecnica del Litoral in Guayaquil, Ecuador. She enrolled in the master program at Florida State University in the fall of Vanessa s research interests include database systems, distributed systems and analysis of algorithms. 17
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
More informationApache Hadoop new way for the company to store and analyze big data
Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More information!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets
!"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
More informationHadoopRDF : A Scalable RDF Data Analysis System
HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn
More informationSetup Hadoop On Ubuntu Linux. ---Multi-Node Cluster
Setup Hadoop On Ubuntu Linux ---Multi-Node Cluster We have installed the JDK and Hadoop for you. The JAVA_HOME is /usr/lib/jvm/java/jdk1.6.0_22 The Hadoop home is /home/user/hadoop-0.20.2 1. Network Edit
More informationIntroduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
More informationTutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
More informationHadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads
HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads Azza Abouzeid 1, Kamil Bajda-Pawlikowski 1, Daniel Abadi 1, Avi Silberschatz 1, Alexander Rasin 2 1 Yale University,
More informationTP1: Getting Started with Hadoop
TP1: Getting Started with Hadoop Alexandru Costan MapReduce has emerged as a leading programming model for data-intensive computing. It was originally proposed by Google to simplify development of web
More informationJeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
More informationHadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationLecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015
Lecture 2 (08/31, 09/02, 09/09): Hadoop Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 K. Zhang BUDT 758 What we ll cover Overview Architecture o Hadoop
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationLecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
More informationWeekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
More informationHadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers
More informationDATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
More informationOverview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
More informationHadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
More informationHow To Install Hadoop 1.2.1.1 From Apa Hadoop 1.3.2 To 1.4.2 (Hadoop)
Contents Download and install Java JDK... 1 Download the Hadoop tar ball... 1 Update $HOME/.bashrc... 3 Configuration of Hadoop in Pseudo Distributed Mode... 4 Format the newly created cluster to create
More informationHadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?
Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationHow To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI
More informationmarlabs driving digital agility WHITEPAPER Big Data and Hadoop
marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil
More informationRunning Hadoop On Ubuntu Linux (Multi-Node Cluster) - Michael G...
Go Home About Contact Blog Code Publications DMOZ100k06 Photography Running Hadoop On Ubuntu Linux (Multi-Node Cluster) From Michael G. Noll Contents 1 What we want to do 2 Tutorial approach and structure
More informationInternals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
More informationArchitectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
More informationCS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment
CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment James Devine December 15, 2008 Abstract Mapreduce has been a very successful computational technique that has
More informationThe Performance of MapReduce: An In-depth Study
The Performance of MapReduce: An In-depth Study Dawei Jiang Beng Chin Ooi Lei Shi Sai Wu School of Computing National University of Singapore {jiangdw, ooibc, shilei, wusai}@comp.nus.edu.sg ABSTRACT MapReduce
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department
More informationInstallation and Configuration Documentation
Installation and Configuration Documentation Release 1.0.1 Oshin Prem October 08, 2015 Contents 1 HADOOP INSTALLATION 3 1.1 SINGLE-NODE INSTALLATION................................... 3 1.2 MULTI-NODE
More informationWhat We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea
What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea Overview Riding Google App Engine Taming Hadoop Summary Riding
More informationINTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
More informationAccelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
More informationHadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010
Hadoop s Entry into the Traditional Analytical DBMS Market Daniel Abadi Yale University August 3 rd, 2010 Data, Data, Everywhere Data explosion Web 2.0 more user data More devices that sense data More
More informationIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process
More informationChase Wu New Jersey Ins0tute of Technology
CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at
More informationHadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science
A Seminar report On Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science SUBMITTED TO: www.studymafia.org SUBMITTED BY: www.studymafia.org
More informationA Comparison of Approaches to Large-Scale Data Analysis
A Comparison of Approaches to Large-Scale Data Analysis Sam Madden MIT CSAIL with Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, and Michael Stonebraker In SIGMOD 2009 MapReduce
More informationA REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information
More informationMapReduce Job Processing
April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File
More informationPLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
More informationComparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in
More informationLarge scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
More informationHadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed
More informationBigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
More informationChapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
More informationHadoop Distributed Filesystem. Spring 2015, X. Zhang Fordham Univ.
Hadoop Distributed Filesystem Spring 2015, X. Zhang Fordham Univ. MapReduce Programming Model Split Shuffle Input: a set of [key,value] pairs intermediate [key,value] pairs [k1,v11,v12, ] [k2,v21,v22,
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationInternational Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
More informationA very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
More informationA STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
More informationPARADISE: Big data analytics using the DBMS tightly integrated with the distributed file system
DOI 10.1007/s11280-014-0312-2 PARADISE: Big data analytics using the DBMS tightly integrated with the distributed file system Jun-Sung Kim Kyu-Young Whang Hyuk-Yoon Kwon Il-Yeol Song Received: 10 June
More informationA Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
More informationMap Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
More informationEntering the Zettabyte Age Jeffrey Krone
Entering the Zettabyte Age Jeffrey Krone 1 Kilobyte 1,000 bits/byte. 1 megabyte 1,000,000 1 gigabyte 1,000,000,000 1 terabyte 1,000,000,000,000 1 petabyte 1,000,000,000,000,000 1 exabyte 1,000,000,000,000,000,000
More informationCloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
More informationWelcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
More informationHadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab
IBM CDL Lab Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网 Information Management 2012 IBM Corporation Agenda Hadoop 技 术 Hadoop 概 述 Hadoop 1.x Hadoop 2.x Hadoop 生 态
More informationBig Data Management and NoSQL Databases
NDBI040 Big Data Management and NoSQL Databases Lecture 3. Apache Hadoop Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Apache Hadoop Open-source
More informationHadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps
Distributed Recommenders Fall 2010 Distributed Recommenders Distributed Approaches are needed when: Dataset does not fit into memory Need for processing exceeds what can be provided with a sequential algorithm
More informationOpen source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
More informationDistributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
More informationWhite Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP
White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big
More informationHadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.
Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software
More informationParallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model
More informationAnalysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
More informationBig Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
More informationThe Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have
More informationSpring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE
Spring,2015 Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Contents: Briefly About Big Data Management What is hive? Hive Architecture Working
More informationTake An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
More informationHDFS Users Guide. Table of contents
Table of contents 1 Purpose...2 2 Overview...2 3 Prerequisites...3 4 Web Interface...3 5 Shell Commands... 3 5.1 DFSAdmin Command...4 6 Secondary NameNode...4 7 Checkpoint Node...5 8 Backup Node...6 9
More informationApache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
More informationHadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
More informationRole of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
More informationTutorial for Assignment 2.0
Tutorial for Assignment 2.0 Florian Klien & Christian Körner IMPORTANT The presented information has been tested on the following operating systems Mac OS X 10.6 Ubuntu Linux The installation on Windows
More informationHadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
More informationHadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
More informationBig Data Storage Options for Hadoop Sam Fineberg, HP Storage
Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations
More informationHadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
More informationBig Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13
Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex
More informationTHE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
More informationOutline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
More informationMapReduce. Tushar B. Kute, http://tusharkute.com
MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity
More informationAlternatives to HIVE SQL in Hadoop File Structure
Alternatives to HIVE SQL in Hadoop File Structure Ms. Arpana Chaturvedi, Ms. Poonam Verma ABSTRACT Trends face ups and lows.in the present scenario the social networking sites have been in the vogue. The
More informationCloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
More informationSriram Krishnan, Ph.D. sriram@sdsc.edu
Sriram Krishnan, Ph.D. sriram@sdsc.edu (Re-)Introduction to cloud computing Introduction to the MapReduce and Hadoop Distributed File System Programming model Examples of MapReduce Where/how to run MapReduce
More informationOracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>
s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline
More informationMapReduce, Hadoop and Amazon AWS
MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables
More informationGraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
More information1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation
1. GridGain In-Memory Accelerator For Hadoop GridGain's In-Memory Accelerator For Hadoop edition is based on the industry's first high-performance dual-mode in-memory file system that is 100% compatible
More informationFault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
More informationHadoop and its Usage at Facebook. Dhruba Borthakur dhruba@apache.org, June 22 rd, 2009
Hadoop and its Usage at Facebook Dhruba Borthakur dhruba@apache.org, June 22 rd, 2009 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed on Hadoop Distributed File System Facebook
More informationHadoop Training Hands On Exercise
Hadoop Training Hands On Exercise 1. Getting started: Step 1: Download and Install the Vmware player - Download the VMware- player- 5.0.1-894247.zip and unzip it on your windows machine - Click the exe
More informationMySQL and Hadoop. Percona Live 2014 Chris Schneider
MySQL and Hadoop Percona Live 2014 Chris Schneider About Me Chris Schneider, Database Architect @ Groupon Spent the last 10 years building MySQL architecture for multiple companies Worked with Hadoop for
More information