Distributed Systems + Middleware Hadoop
|
|
- Percival Hunter
- 8 years ago
- Views:
Transcription
1 Distributed Systems + Middleware Hadoop Alessandro Sivieri Dipartimento di Elettronica, Informazione e Bioingegneria Politecnico, Italy alessandro.sivieri@polimi.it
2 Contents Introduction to Hadoop History MapReduce HDFS Pig Distributed Systems + Middleware: Hadoop 2
3 INTRODUCTION TO HADOOP Distributed Systems + Middleware: Hadoop 3
4 Data We live in a digital world that produces data at an impressive speed As of 2012, 2.7 ZB of data exist (1 ZB = Bytes) NYSE produces 1 TB of data per day The Internet Archive grows by 20 TB per month The LHC produces 15 PB of data per year AT&T has a 300 TB database 100 TB of data uploaded daily on Facebook Distributed Systems + Middleware: Hadoop 4
5 Data Personal data is growing, too E.g., photos: a single photo taken with a Nikon commercial camera takes about 6 MB (default settings); a year of family photos takes about 8 GB of space adding to the slices of personal stuff uploaded on social networks, video Websites, blogs and more Machine-produced data is also growing Machine logs Sensor networks and monitored data Distributed Systems + Middleware: Hadoop 5
6 Data analysis Main problem: disk reading speed / capacity has not really improved Solution: parallelize the storage and read less data from each disk Problems: hardware replication, data aggregation Take, for example, a RDBMS (keeping in mind that the seeking time on disks measures the latency of the operations): Updating records is fast: a B-Tree structure is efficient Reading many records is slow: if access is dominated by seeking time, it is faster to read the entire disk (which operates at transfer time) Distributed Systems + Middleware: Hadoop 6
7 Hadoop Reliable data storage: Hadoop Distributed File System Data analysis: MapReduce implementation Many tools for the developers Easy cluster administration Query languages Some are similar to SQL Column-oriented distributed databases on top of Hadoop Structured to unstructured repositories and back Distributed Systems + Middleware: Hadoop 7
8 Hadoop vs. The (existing) World RDBMS: Disk seek time Some types of data are not normalized (e.g., logs): MapReduce works well with unstructured data MapReduce scales linearly (while a RDBMS does not) Volunteer computing (e.g., SETI@home) Similar model, but Hadoop works in a localized cluster sharing high-performance bandwidth, while volunteer computing works over the Internet on untrusted computers performing other operations meanwhile Distributed Systems + Middleware: Hadoop 8
9 Hadoop vs. The (existing) World MPI: Works well for compute-intensive jobs, but network becomes the bottleneck when hundredths of GB of data have to be analyzed Conversely, MapReduce does its best to exploit data locality by collocate the data with the compute node (network bandwidth is the most precious resource, it must not be wasted) MapReduce operates at a higher level wrt MPI: data flow is already taken care MapReduce implements failure recovery (in MPI the developer has to handle checkpoints and failure recovery) Distributed Systems + Middleware: Hadoop 9
10 HADOOP HISTORY Distributed Systems + Middleware: Hadoop 10
11 Brief history In 2002, Mike Cafarella and Doug Cutting started working on Apache Nutch, a new Web search engine In 2003, Google published a paper on the Google File System, a distributed filesystem, and Mike and Doug started working on a similar, open source, project In 2004, Google published another paper, on the MapReduce computation model, and yet again Mike and Doug implemented an open source version in Nutch Distributed Systems + Middleware: Hadoop 11
12 Brief history In 2006, these two projects separated from Nutch and became Hadoop In the same year, Doug Cutting started working for Yahoo! and started using Hadoop there In 2008 Hadoop was used by Yahoo! (10000-core cluster), Last.fm, Facebook and the NYT In 2009, Yahoo! broke the world record for sorting 1 TB of data in 62 seconds, using Hadoop Since then, Hadoop became mainstream in industry Distributed Systems + Middleware: Hadoop 12
13 Examples from the Real World Last.fm Each user listening to a song (local or in streaming) generates a trace Hadoop analyses these traces to produce charts Facebook E.g., track statistics per user and per country, weekly top tracks Daily and hourly summaries over user logs Products usage, ads campaigns Ad-hoc jobs over historical data Long term archival store Integrity checks Distributed Systems + Middleware: Hadoop 13
14 Examples from the Real World Nutch search engine Link inversion: find outgoing links that point to a specific Web page URL fetching Produce Lucene indexes (for text searches) Infochimps: explore network graphs Social networks: Twitter analysis, measure communities Biology: neuron connections in roundworms Street connections: OpenStreetMap Distributed Systems + Middleware: Hadoop 14
15 Hadoop umbrella HDFS: distributed filesystem MapReduce: distributed data processing model MRUnit: unit testing of MapReduce applications Pig: data flow language to explore large datasets Hive: distributed data warehouse HBase: distributed, column-oriented db ZooKeeper: distributed coordination service Sqoop: efficient bulk transfers of data over HDFS Distributed Systems + Middleware: Hadoop 15
16 MAPREDUCE Distributed Systems + Middleware: Hadoop 16
17 MapReduce Model for analyzing large amounts of data Data has to be organized as a key value dataset Two phases: map(k1, v1) -> list(k2, v2), where the input domain is different from the output domain (shuffle: intermediate phase to sort the output of map and group by key) reduce(k2, list(v2)) -> list(v3), where the input and output domain is the same Distributed Systems + Middleware: Hadoop 17
18 MapReduce on Hadoop Job: unit of work to be performed by the system Input data Map and Reduce implementations Configuration Map task and Reduce task: smaller pieces of a job Jobtracker: node of the cluster coordinating the job Tasktracker: runs a task and reports to the jobtracker Split: part of the input Distributed Systems + Middleware: Hadoop 18
19 MapReduce on Hadoop The jobtracker splits the input in parts, and a map task is run for each split Splits are run in parallel on all nodes Hadoop tries hard to run a map task for a specific split in the same node where HDFS has saved that split If not possible, at least in the same rack Map task output is written on disk (not on HDFS: intermediate data do not need replication) Reduce tasks receive map output through network (no data locality here) Final output is saved on HDFS Distributed Systems + Middleware: Hadoop 19
20 MapReduce on Hadoop Distributed Systems + Middleware: Hadoop 20
21 MapReduce on Hadoop (with multiple reduce tasks) Distributed Systems + Middleware: Hadoop 21
22 MapReduce on Hadoop (no reduce) Distributed Systems + Middleware: Hadoop 22
23 Intermediate data Combiner function Aggregates data from several map outputs To minimize network transfer Hadoop decides if this is needed, it may not be executed at all In a way, it can be seen as a local instance of the reduce function Shuffle Sorts each map output Transfers it to (one of the) reducers Merges it with other map outputs, maintaining sorting Everything can be configured Memory, buffer sizes, parallel copies check the book! Distributed Systems + Middleware: Hadoop 23
24 MapReduce on Hadoop Everything is configurable Number of map tasks Number of reduce tasks Data compression Custom serialization Memory management Profilers to understand the performances in detail Tools to enable streaming data in subsequent MapReduce runs, called workflows (Apache Oozie) For complex problems Distributed Systems + Middleware: Hadoop 24
25 MapReduce on Hadoop: versions Version 1: first developed version of Hadoop MR The runtime contains the previously mentioned Job and Tasktrackers Still developed (latest version: 1.2) Version 2: the new Hadoop MR The runtime called YARN (Yet Another Resource Negotiator), substitutes (an generalizes) the previous trackers New HDFS features Different configurations and APIs The book mixes the two versions here and there We will follow version 2 (=> Hadoop 2.2.0, the latest version) Distributed Systems + Middleware: Hadoop 25
26 MapReduce example Main interface is written in Java Hadoop is written in Java There are interfaces in many other languages Hadoop is able to work in streaming mode, using Unix pipes Other languages take advantage of this Python Ruby C++ Distributed Systems + Middleware: Hadoop 26
27 MapReduce example A simple dataset Temperature and humidity measured through WSN motes in two rooms Timestamp Room Temperature Humidity Battery Distributed Systems + Middleware: Hadoop 27
28 MapReduce example Dataset Sampling time: 5 minutes Total dataset: about 566 days (more than 1.5 years) CSV (only a few megs it doesn t really exploit Hadoop, but it will help us understand how it works) Calculations Max temperature per day Mean temperature per day Distributed Systems + Middleware: Hadoop 28
29 MapReduce example Classes to be implemented Mapper Reducer Job handler (the main file, launching the execution and waiting for results) this will handle the configuration, too Configuration Default filesystem (HDFS, but Hadoop can read from standard filesystems) Default configuration for node and data managers Distributed Systems + Middleware: Hadoop 29
30 Hadoop configurations Standalone (default configuration) Runs on single JVM No parallelization Useful for debugging purposes Pseudo-distributed (single-host cluster) Parallelization Runs daemons on a single host Cluster Full scale Hadoop Distributed Systems + Middleware: Hadoop 30
31 Hadoop configuration We will see standalone and pseudo-distributed I will include cluster configuration in the exercises, but it requires a certain amount of memory to be run For standalone, you don t need to do almost anything Just set JAVA_HOME environment variable in hadoopenv.sh, before executing hadoop itself Notice that Windows is NOT supported in production mode for Hadoop Examples have been tested on Linux and OS X Distributed Systems + Middleware: Hadoop 31
32 Mapper public class MeanTemperatureMapper extends Mapper<LongWritable, Text, Text, DoubleWritable> } protected void map(longwritable key, Text value, Context context) { } String line = value.tostring(); String[] parts = line.split("\\s+"); if (parts.length < 3) { return; } long timestamp = Long.parseLong(parts[0]) * 1000; int mote = Integer.parseInt(parts[1]); if (mote!= 18) { return; } int temperatureint = Integer.parseInt(parts[2]); double temperature = temperatureint / 100.0; Date date = new Date(timestamp); DateFormat df = new SimpleDateFormat("yyyyMMdd"); String datekey = df.format(date); context.write(new Text(dateKey), new DoubleWritable(temperature)); Distributed Systems + Middleware: Hadoop 32
33 Mapper Timestamp Day Room Temperature Temperature Humidity Battery Distributed Systems + Middleware: Hadoop 33
34 Reducer public class MeanTemperatureReducer extends Reducer<Text, DoubleWritable, Text, DoubleWritable> protected void reduce(text key, Iterable<DoubleWritable> values, Context context) { double sum = 0; int counter = 0; for (DoubleWritable dw : values) { sum += dw.get(); ++counter; } context.write(key, new DoubleWritable(sum / counter)); } } Distributed Systems + Middleware: Hadoop 34
35 Reducer Day Temperature Mean temperature Distributed Systems + Middleware: Hadoop 35
36 Job handler public class MeanTemperatureJob extends Configured implements Tool public int run(string[] args) throws Exception { if (args.length!= 2) { System.err.printf("Usage: %s [generic options] <input> <output>\n", getclass().getsimplename()); ToolRunner.printGenericCommandUsage(System.err); return -1; } Job job = Job.getInstance(getConf(), "Mean temperature"); job.setjarbyclass(getclass()); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setmapperclass(meantemperaturemapper.class); job.setreducerclass(meantemperaturereducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(doublewritable.class); return job.waitforcompletion(true)? 0 : 1; } public static void main(string[] args) throws Exception { int exitcode = ToolRunner.run(new MeanTemperatureJob(), args); System.exit(exitCode); } Distributed } Systems + Middleware: Hadoop 36
37 Run MapReduce Java classes are compiled through Maven The example will contain a sample pom.xml file To run the example: mvn package export HADOOP_CLASSPATH=somename.jar hadoop MeanTemperatureJob fs file:/// input/file.csv output Output directory MUST not exist Notice the fs option Distributed Systems + Middleware: Hadoop 37
38 HDFS Distributed Systems + Middleware: Hadoop 38
39 HDFS The previous example used the OS filesystem Hadoop does not have any problem with that But, if we want to move to pseudo-distributed and cluster modes, we need to start using HDFS Hadoop Distributed File System is designed to store very large files suitable for streaming data access Distributed Systems + Middleware: Hadoop 39
40 Name and data nodes HDFS has a single* namenode (master) and several datanodes (slaves) The namenode manages the filesystem namespace (i.e., metadata) It maintains a namespace image and an edit log It has a list of all the datanodes and which blocks they store It does not maintain any file by itself * From version 2, Hadoop added the concept of HDFS federation Backup namenodes! Distributed Systems + Middleware: Hadoop 40
41 File distribution A datanode is a machine storing blocks of files, continuously communicating with the namenode Without a namenode, a datanode is useless Each file uploaded to HDFS is split in blocks of 64 MB each Minimize seek, maximize transfer rate If a file (or the last block of a file) is smaller than 64 MB, it does not occupy 64 MB This is different from OS filesystems A file can be bigger than a single disk in the network Distributed Systems + Middleware: Hadoop 41
42 File distribution Distributed Systems + Middleware: Hadoop 42
43 Interactions Command-line interface ls, mkdir, copyfromlocal, cat Third-party tools Fuse Java APIs Distributed Systems + Middleware: Hadoop 43
44 Writable datatypes Mappers and Reducers used particular types Text, DoubleWritable, LongWritable Hadoop defines specific types wrapping Java types, optimizing network serialization To add new types, you can implement Writable and WritableComparable Distributed Systems + Middleware: Hadoop 44
45 MapReduce example (reprise) Let s assume we want to output the mean temperature per day and per mote We need to use as key in our key-value pairs the pair day mote Create a new class and extend WritableComparable Comparable is needed because Hadoop sorts keys before reducing Distributed Systems + Middleware: Hadoop 45
46 MapReduce example (reprise) public class RoomDayWritable implements WritableComparable<RoomDayWritable> { private Text date; private IntWritable mote; public RoomDayWritable(String date, int mote) { this.date = new Text(date); this.mote = new IntWritable(mote); public void write(dataoutput out) throws IOException { this.mote.write(out); this.date.write(out); } public void readfields(datainput in) throws IOException { this.mote.readfields(in); this.date.readfields(in); public int hashcode() { public boolean equals(object obj) { public int compareto(roomdaywritable other) { } Distributed Systems + Middleware: Hadoop 46
47 MapReduce example (reprise) Day Mote Temperature Day Mote Mean temperature Distributed Systems + Middleware: Hadoop 47
48 Single node Configuration needs to be changed Location of all the daemons is localhost Number of replicas is 1 Format HDFS Start the HDFS daemon Start the YARN daemon Run the demo Distributed Systems + Middleware: Hadoop 48
49 Configuration core-site.xml <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration> hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> mapred-site.xml <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> yarn-site.xml <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name> yarn.nodemanager.aux-services.mapreduce.shuffle.class </name> <value>org.apache.hadoop.mapred.shufflehandler</value> </property> </configuration> Distributed Systems + Middleware: Hadoop 49
50 Run the example Format HDFS hdfs namenode format Start daemons (check the logs!) start-dfs.sh start-yarn.sh mr-jobhistory-daemon.sh start historyserver (*) jps Copy the file on HDFS hadoop fs mkdir p /user/hadoop hadoop fs copyfromlocal source.csv. Distributed Systems + Middleware: Hadoop 50
51 Run the example Run hadoop jar somename.jar MeanTemperatureJob source.csv output Again, the output directory (on HDFS!) MUST not exist Check the output hadoop fs cat output/part-r Distributed Systems + Middleware: Hadoop 51
52 Cluster mode Configuration needs to be changed See example file on the course Website A new file, called slaves, has to be created on the master node, containing addresses of all the datanodes Running the example works in the exact same way Distributed Systems + Middleware: Hadoop 52
53 PIG Distributed Systems + Middleware: Hadoop 53
54 Pig Language and runtime to perform complex queries on richer data structures Pig Latin is the language to express data flows Runtime to execute scripts on Hadoop clusters A script is a sequence of transformations on initial data They will be translated into MapReduce jobs Faster development: analysis can be written in a few lines and executes on terabytes of data User Defined Functions can extend Pig capabilities Performances are comparable with native MR code* Distributed Systems + Middleware: Hadoop 54
55 Pig Three execution modes Script Interactive shell (Grunt) Embedded in Java IDE plugins Scripts can be run in local mode (single JVM) or on a cluster (pseudo or real) Pig is able to generate a (reasonably) complete and concise sample dataset for a script Distributed Systems + Middleware: Hadoop 55
56 MapReduce example with Pig REGISTER./hadooptests-1.0.jar; raw = LOAD 'temperature-sorted.csv' USING PigStorage('\t') AS (timestamp:long, mote:int, temperature:int, humidity:int, battery:int); clean = FILTER raw BY mote!= 1 day = FOREACH clean GENERATE me.sivieri.hadoop.pig.extractdate(timestamp) as date, mote, temperature / as temperature; grouped_date_mote = GROUP day BY (date, mote); mean_temp = FOREACH grouped_date_mote GENERATE group, AVG(day.temperature); DUMP mean_temp; Distributed Systems + Middleware: Hadoop 56
57 MapReduce example with Pig public class ExtractDate extends EvalFunc<String> } public String exec(tuple arg0) throws IOException { } if (arg0 == null arg0.size() == 0) return null; try { Long timestamp = (Long) arg0.get(0); Date date = new Date(timestamp); DateFormat df = new SimpleDateFormat("yyyyMMdd"); return df.format(date); } catch(exception e){ } System.err.println("ExtractDate: failed to proces input; error - " + e.getmessage()); return null; Distributed Systems + Middleware: Hadoop 57
58 Bibliography Tom White, Hadoop The definitive guide, 3 rd Edition, O Reilly Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung, The Google File System, Google 2003 Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Google 2004 Distributed Systems + Middleware: Hadoop 58
INTRODUCTION TO HADOOP
Hadoop INTRODUCTION TO HADOOP Distributed Systems + Middleware: Hadoop 2 Data We live in a digital world that produces data at an impressive speed As of 2012, 2.7 ZB of data exist (1 ZB = 10 21 Bytes)
More informationHadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab
IBM CDL Lab Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网 Information Management 2012 IBM Corporation Agenda Hadoop 技 术 Hadoop 概 述 Hadoop 1.x Hadoop 2.x Hadoop 生 态
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationMap Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
More informationProcessing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems
Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:
More informationHadoop Configuration and First Examples
Hadoop Configuration and First Examples Big Data 2015 Hadoop Configuration In the bash_profile export all needed environment variables Hadoop Configuration Allow remote login Hadoop Configuration Download
More informationGetting to know Apache Hadoop
Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the
More informationand HDFS for Big Data Applications Serge Blazhievsky Nice Systems
Introduction PRESENTATION to Hadoop, TITLE GOES MapReduce HERE and HDFS for Big Data Applications Serge Blazhievsky Nice Systems SNIA Legal Notice The material contained in this tutorial is copyrighted
More informationPrepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
More informationExtreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk
Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless
More informationXiaoming Gao Hui Li Thilina Gunarathne
Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal
More informationWorking With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology
Working With Hadoop Now that we covered the basics of MapReduce, let s look at some Hadoop specifics. Mostly based on Tom White s book Hadoop: The Definitive Guide, 3 rd edition Note: We will use the new
More informationHadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationHadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
More informationBIG DATA, MAPREDUCE & HADOOP
BIG, MAPREDUCE & HADOOP LARGE SCALE DISTRIBUTED SYSTEMS By Jean-Pierre Lozi A tutorial for the LSDS class LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 1 OBJECTIVES OF THIS LAB SESSION The LSDS
More informationHADOOP MOCK TEST HADOOP MOCK TEST II
http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
More informationIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process
More informationIntroduc)on to Map- Reduce. Vincent Leroy
Introduc)on to Map- Reduce Vincent Leroy Sources Apache Hadoop Yahoo! Developer Network Hortonworks Cloudera Prac)cal Problem Solving with Hadoop and Pig Slides will be available at hgp://lig- membres.imag.fr/leroyv/
More informationHadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart
Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated
More informationInternals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
More informationHadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.
Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software
More informationCOURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
More informationSession: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW
Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software 22 nd October 2013 10:00 Sesión B - DB2 LUW 1 Agenda Big Data The Technical Challenges Architecture of Hadoop
More informationInternational Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationBig Data Management and NoSQL Databases
NDBI040 Big Data Management and NoSQL Databases Lecture 3. Apache Hadoop Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Apache Hadoop Open-source
More informationHadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
More informationBig Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
More informationDistributed Filesystems
Distributed Filesystems Amir H. Payberah Swedish Institute of Computer Science amir@sics.se April 8, 2014 Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 1 / 32 What is Filesystem? Controls
More informationQsoft Inc www.qsoft-inc.com
Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:
More informationINTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
More informationHadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb
More informationDATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
More informationLarge scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
More informationIntroduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
More informationPro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah
Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big
More informationThe Hadoop Eco System Shanghai Data Science Meetup
The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related
More informationMapReduce with Apache Hadoop Analysing Big Data
MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues
More informationBig Data Too Big To Ignore
Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationHadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology
Hadoop Dawid Weiss Institute of Computing Science Poznań University of Technology 2008 Hadoop Programming Summary About Config 1 Open Source Map-Reduce: Hadoop About Cluster Configuration 2 Programming
More informationHow To Install Hadoop 1.2.1.1 From Apa Hadoop 1.3.2 To 1.4.2 (Hadoop)
Contents Download and install Java JDK... 1 Download the Hadoop tar ball... 1 Update $HOME/.bashrc... 3 Configuration of Hadoop in Pseudo Distributed Mode... 4 Format the newly created cluster to create
More informationJeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
More informationHow To Use Hadoop
Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationData-Intensive Programming. Timo Aaltonen Department of Pervasive Computing
Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti
More informationHADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM
HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM 1. Introduction 1.1 Big Data Introduction What is Big Data Data Analytics Bigdata Challenges Technologies supported by big data 1.2 Hadoop Introduction
More information!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets
!"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.
More informationOverview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
More informationComplete Java Classes Hadoop Syllabus Contact No: 8888022204
1) Introduction to BigData & Hadoop What is Big Data? Why all industries are talking about Big Data? What are the issues in Big Data? Storage What are the challenges for storing big data? Processing What
More informationIntroduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu
Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan sriram@sdsc.edu Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.
More informationHadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed
More informationHadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
More informationCS54100: Database Systems
CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics
More informationITG Software Engineering
Introduction to Apache Hadoop Course ID: Page 1 Last Updated 12/15/2014 Introduction to Apache Hadoop Course Overview: This 5 day course introduces the student to the Hadoop architecture, file system,
More informationPeers Techno log ies Pv t. L td. HADOOP
Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and
More informationApache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
More informationHDFS. Hadoop Distributed File System
HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files
More informationLecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
More informationHadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
More informationL1: Introduction to Hadoop
L1: Introduction to Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General
More informationBig Data Course Highlights
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
More informationLambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014
Lambda Architecture CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 1 Goals Cover the material in Chapter 8 of the Concurrency Textbook The Lambda Architecture Batch Layer MapReduce
More informationData-intensive computing systems
Data-intensive computing systems Hadoop Universtity of Verona Computer Science Department Damiano Carra Acknowledgements! Credits Part of the course material is based on slides provided by the following
More informationWorkshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
More informationHadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps
Distributed Recommenders Fall 2010 Distributed Recommenders Distributed Approaches are needed when: Dataset does not fit into memory Need for processing exceeds what can be provided with a sequential algorithm
More informationHadoop: The Definitive Guide
Hadoop: The Definitive Guide Tom White foreword by Doug Cutting O'REILLY~ Beijing Cambridge Farnham Köln Sebastopol Taipei Tokyo Table of Contents Foreword Preface xiii xv 1. Meet Hadoop 1 Da~! 1 Data
More informationLecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015
Lecture 2 (08/31, 09/02, 09/09): Hadoop Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 K. Zhang BUDT 758 What we ll cover Overview Architecture o Hadoop
More informationHadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com
Hadoop and Eclipse Eclipse Hawaii User s Group May 26th, 2009 Seth Ladd http://sethladd.com Goal YOU can use the same technologies as The Big Boys Google Yahoo (2000 nodes) Last.FM AOL Facebook (2.5 petabytes
More informationHadoop Distributed Filesystem. Spring 2015, X. Zhang Fordham Univ.
Hadoop Distributed Filesystem Spring 2015, X. Zhang Fordham Univ. MapReduce Programming Model Split Shuffle Input: a set of [key,value] pairs intermediate [key,value] pairs [k1,v11,v12, ] [k2,v21,v22,
More informationHadoop and its Usage at Facebook. Dhruba Borthakur dhruba@apache.org, June 22 rd, 2009
Hadoop and its Usage at Facebook Dhruba Borthakur dhruba@apache.org, June 22 rd, 2009 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed on Hadoop Distributed File System Facebook
More informationBig Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich
Big Data Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich First, an Announcement There will be a repetition exercise group on Wednesday this week. TAs will answer your questions on SQL, relational
More informationA Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
More informationHadoop: The Definitive Guide
FOURTH EDITION Hadoop: The Definitive Guide Tom White Beijing Cambridge Famham Koln Sebastopol Tokyo O'REILLY Table of Contents Foreword Preface xvii xix Part I. Hadoop Fundamentals 1. Meet Hadoop 3 Data!
More informationOpen source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
More informationMapReduce. Tushar B. Kute, http://tusharkute.com
MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity
More informationIntroduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
More informationIntro to Map/Reduce a.k.a. Hadoop
Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by
More informationHadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013
Hadoop 101 Lars George NoSQL- Ma4ers, Cologne April 26, 2013 1 What s Ahead? Overview of Apache Hadoop (and related tools) What it is Why it s relevant How it works No prior experience needed Feel free
More informationCloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
More informationChapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
More informationLecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
More informationHadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
More informationHadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN
Hadoop Framework technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate addi8onal
More informationProgramming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing MapReduce and Hadoop 15 319, spring 2010 17 th Lecture, Mar 16 th Majd F. Sakr Lecture Goals Transition to MapReduce from Functional Programming Understand the origins of
More informationA very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
More informationBig Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13
Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex
More informationSystems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2013/14 Lecture XI: MapReduce & Hadoop The new world of Big Data (programming model) Big Data Buzzword for challenges occurring
More informationData-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion
More informationHadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationTutorial- Counting Words in File(s) using MapReduce
Tutorial- Counting Words in File(s) using MapReduce 1 Overview This document serves as a tutorial to setup and run a simple application in Hadoop MapReduce framework. A job in Hadoop MapReduce usually
More informationCS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
More informationCS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
More informationBIG DATA APPLICATIONS
BIG DATA ANALYTICS USING HADOOP AND SPARK ON HATHI Boyu Zhang Research Computing ITaP BIG DATA APPLICATIONS Big data has become one of the most important aspects in scientific computing and business analytics
More information