Hadoop and Big Data. Keijo Heljanko. Department of Information and Computer Science School of Science Aalto University

Size: px
Start display at page:

Download "Hadoop and Big Data. Keijo Heljanko. Department of Information and Computer Science School of Science Aalto University keijo.heljanko@aalto."

Transcription

1 Keijo Heljanko Department of Information and Computer Science School of Science Aalto University 1/77

2 Business Drivers of Cloud Computing Large data centers allow for economics of scale Cheaper hardware purchases Cheaper cooling of hardware Example: Google paid 40 MEur for a Summa paper mill site in Hamina, Finland: Data center cooled with sea water from the Baltic Sea Cheaper electricity Cheaper network capacity Smaller number of administrators / computer Unreliable commodity hardware is used Reliability obtained by replication of hardware components and a combined with a fault tolerant software stack 2/77

3 Cloud Computing Technologies A collection of technologies aimed to provide elastic pay as you go computing Virtualization of computing resources: Amazon EC2, Eucalyptus, OpenNebula, Open Stack Compute,... Scalable file storage: Amazon S3, GFS, HDFS,... Scalable batch processing: Google MapReduce, Apache Hadoop, PACT, Microsoft Dryad, Google Pregel, Berkeley Spark,... Scalable datastore:amazon Dynamo, Apache Cassandra, Google Bigtable, HBase,... Distributed Consensus: Google Chubby, Apache Zookeeper,... Scalable Web applications hosting: Google App Engine, Microsoft Azure, Heroku,... 3/77

4 Clock Speed of Processors Herb Sutter: The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software. Dr. Dobb s Journal, 30(3), March 2005 (updated graph in August 2009). 4/77

5 Implications of the End of Free Lunch The clock speeds of microprocessors are not going to improve much in the foreseeable future The efficiency gains in single threaded performance are going to be only moderate The number of transistors in a microprocessor is still growing at a high rate One of the main uses of transistors has been to increase the number of computing cores the processor has The number of cores in a low end workstation (as those employed in large scale datacenters) is going to keep on steadily growing Programming models need to change to efficiently exploit all the available concurrency - scalability to high number of cores/processors will need to be a major focus 5/77

6 Warehouse-scale Computing (WSC) The smallest unit of computation in Google scale is: Warehouse full of computers [WSC]: Luiz André Barroso, Urs Hölzle: The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines Morgan & Claypool Publishers The WSC book says:... we must treat the datacenter itself as one massive warehouse-scale computer (WSC). 6/77

7 Jeffrey Dean (Google): LADIS 2009 keynote failure numbers LADIS 2009 keynote: Designs, Lessons and Advice from Building Large Distributed Systems ladis2009/talks/dean-keynote-ladis2009.pdf At warehouse-scale, failures will be the norm and thus fault tolerant software will be inevitable Hypothetically assume that server mean time between failures (MTBF) would be 30 years (highly optimistic and not realistic number!). In a WSC with nodes roughly one node will fail per day. Typical yearly flakiness metrics from J. Dean (Google, 2009): 1-5% of your disk drives will die Servers will crash at least twice (2-4% failure rate) 7/77

8 Big Data As of May 2009, the amount of digital content in the world is estimated to be 500 Exabytes (500 million TB) EMC sponsored study by IDC in 2007 estimates the amount of information created in 2010 to be 988 EB Worldwide estimated hard disk sales in 2010: 675 million units Data comes from: Video, digital images, sensor data, biological data, Internet sites, social media,... The problem of such large data massed, termed Big Data calls for new approaches to storage and processing of data 8/77

9 Example: Simple Word Search Example: Suppose you need to search for a word in a 2TB worth of text that is found only once in the text mass using a compute cluster Assuming 100MB/s read speed, in the worst case reading all data from a single 2TB disk takes 5.5 hours If 100 hard disks can be used in parallel, the same task takes less than four minutes Scaling using hard disk parallelism is one of the design goals of scalable batch processing in the cloud 9/77

10 Google MapReduce A scalable batch processing framework developed at Google for computing the Web index When dealing with Big Data (a substantial portion of the Internet in the case of Google!), the only viable option is to use hard disks in parallel to store and process it Some of the challenges for storage is coming from Web services to store and distribute pictures and videos We need a system that can effectively utilize hard disk parallelism and hide hard disk and other component failures from the programmer 10/77

11 Google MapReduce (cnt.) MapReduce is tailored for batch processing with hundreds to thousands of machines running in parallel, typical job runtimes are from minutes to hours As an added bonus we would like to get increased programmer productivity compared to each programmer developing their own tools for utilizing hard disk parallelism 11/77

12 Google MapReduce (cnt.) The MapReduce framework takes care of all issues related to parallelization, synchronization, load balancing, and fault tolerance. All these details are hidden from the programmer The system needs to be linearly scalable to thousands of nodes working in parallel. The only way to do this is to have a very restricted programming model where the communication between nodes happens in a carefully controlled fashion Apache Hadoop is an open source MapReduce implementation used by Yahoo!, Facebook, and Twitter 12/77

13 MapReduce and Functional Programming Based on the functional programming in the large: User is only allowed to write side-effect free functions Map and Reduce Re-execution is used for fault tolerance. If a node executing a Map or a Reduce task fails to produce a result due to hardware failure, the task will be re-executed on another node Side effects in functions would make this impossible, as one could not re-create the environment in which the original task executed One just needs a fault tolerant storage of task inputs The functions themselves are usually written in a standard imperative programming language, usually Java 13/77

14 Why No Side-Effects? Side-effect free programs will produce the same output irregardless of the number of computing nodes used by MapReduce Running the code on one machine for debugging purposes produces the same results as running the same code in parallel It is easy to introduce side-effect to MapReduce programs as the framework does not enforce a strict programming methodology. However, the behavior of such programs is undefined by the framework, and should therefore be avoided. 14/77

15 Yahoo! MapReduce Tutorial We use a Figures from the excellent MapReduce tutorial of Yahoo! [YDN-MR], available from: module4.html In functional programming, two list processing concepts are used Mapping a list with a function Reducing a list with a function 15/77

16 Mapping a List Mapping a list applies the mapping function to each list element (in parallel) and outputs the list of mapped list elements: Figure: Mapping a List with a Map Function, Figure 4.1 from [YDN-MR] 16/77

17 Reducing a List Reducing a list iterates over a list sequentially and produces an output created by the reduce function: Figure: Reducing a List with a Reduce Function, Figure 4.2 from [YDN-MR] 17/77

18 Grouping Map Output by Key to Reducer In MapReduce the map function outputs (key, value)-pairs. The MapReduce framework groups map outputs by key, and gives each reduce function instance (key, (..., list of values,...)) pair as input. Note: Each list of values having the same key will be independently processed: Figure: Keys Divide Map Output to Reducers, Figure 4.3 from [YDN-MR] 18/77

19 MapReduce Data Flow Practical MapReduce systems split input data into large (64MB+) blocks fed to user defined map functions: Figure: High Level MapReduce Dataflow, Figure 4.4 from [YDN-MR] 19/77

20 Recap: Map and Reduce Functions The framework only allows a user to write two functions: a Map function and a Reduce function The Map-function is fed blocks of data (block size MB), and it produces (key, value) -pairs The framework groups all values with the same key to a (key, (..., list of values,...)) format, and these are then fed to the Reduce function A special Master node takes care of the scheduling and fault tolerance by re-executing Mappers or Reducers 20/77

21 MapReduce Diagram Worker (1)fork (2)assign map User program (1)fork Master (2) assign reduce (1)fork split 0 split 1 (3)read split 2 Worker (4) local write (5)Remote read Worker (6)write output file 0 split 3 split 4 Worker output file 1 Worker Input files Map phase Intermediate files (on local disks) Reduce phase Output files Figure: J. Dean and S. Ghemawat: MapReduce: Simplified Data Processing on Large Clusters, OSDI /77

22 Example: Word Count Classic word count example from the Hadoop MapReduce tutorial: current/mapred_tutorial.html Consider doing a word count of the following file using MapReduce: Hello World Bye World Hello Hadoop Goodbye Hadoop 22/77

23 Example: Word Count (cnt.) Consider a Map function that reads in words one a time, and outputs (word, 1) for each parsed input word The Map function output is: (Hello, 1) (World, 1) (Bye, 1) (World, 1) (Hello, 1) (Hadoop, 1) (Goodbye, 1) (Hadoop, 1) 23/77

24 Example: Word Count (cnt.) The Shuffle phase between Map and Reduce phase creates a list of values associated with each key The Reduce function input is: (Bye, (1)) (Goodbye, (1)) (Hadoop, (1, 1)) (Hello, (1, 1)) (World, (1, 1)) 24/77

25 Example: Word Count (cnt.) Consider a reduce function that sums the numbers in the list for each key and outputs (word, count) pairs. The output of the Reduce function is the output of the MapReduce job: (Bye, 1) (Goodbye, 1) (Hadoop, 2) (Hello, 2) (World, 2) 25/77

26 Phases of MapReduce 1. A Master (In Hadoop terminology: Job Tracker) is started that coordinates the execution of a MapReduce job. Note: Master is a single point of failure 2. The master creates a predefined number of M Map workers, and assigns each one an input split to work on. It also later starts a predefined number of R reduce workers 3. Input is assigned to a free Map worker MB split at a time, and each user defined Map function is fed (key, value) pairs as input and also produces (key, value) pairs 26/77

27 Phases of MapReduce(cnt.) 4. Periodically the Map workers flush their (key, value) pairs to the local hard disks, partitioning by their key to R partitions (default: use hashing), one per reduce worker 5. When all the input splits have been processed, a Shuffle phase starts where M R file transfers are used to send all of the mapper outputs to the reducer handling each key partition. After reducer receives the input files, reducer sorts (and groups) the (key, value) pairs by the key 6. User defined Reduce functions iterate over the (key, (..., list of values,...)) lists, generating output (key, value) pairs files, one per reducer 27/77

28 Google MapReduce (cnt.) The user just supplies the Map and Reduce functions, nothing more The only means of communication between nodes is through the shuffle from a mapper to a reducer The framework can be used to implement a distributed sorting algorithm by using a custom partitioning function The framework does automatic parallelization and fault tolerance by using a centralized Job tracker (Master) and a distributed filesystem that stores all data redundantly on compute nodes Uses functional programming paradigm to guarantee correctness of parallelization and to implement fault-tolerance by re-execution 28/77

29 Apache Hadoop An Open Source implementation of the MapReduce framework, originally developed by Doug Cutting and heavily used by e.g., Yahoo! and Facebook Moving Computation is Cheaper than Moving Data - Ship code to data, not data to code. Map and Reduce workers are also storage nodes for the underlying distributed filesystem: Job allocation is first tried to a node having a copy of the data, and if that fails, then to a node in the same rack (to maximize network bandwidth) Project Web page: 29/77

30 Apache Hadoop (cnt.) When deciding whether MapReduce is the correct fit for an algorithm, one has to remember the fixed data-flow pattern of MapReduce. The algorithm has to be efficiently mapped to this data-flow pattern in order to efficiently use the underlying computing hardware Builds reliable systems out of unreliable commodity hardware by replicating most components (exceptions: Master/Job Tracker and NameNode in Hadoop Distributed File System) 30/77

31 Apache Hadoop (cnt.) Tuned for large (gigabytes of data) files Designed for very large 1 PB+ data sets Designed for streaming data accesses in batch processing, designed for high bandwidth instead of low latency For scalability: NOT a POSIX filesystem Written in Java, runs as a set of user-space daemons 31/77

32 Hadoop Distributed Filesystem (HDFS) A distributed replicated filesystem: All data is replicated by default on three different Data Nodes Inspired by the Google Filesystem Each node is usually a Linux compute node with a small number of hard disks (4-12) A single NameNode that maintains the file locations, many DataNodes (1000+) 32/77

33 Hadoop Distributed Filesystem (cnt.) Any piece of data is available if at least one datanode replica is up and running Rack optimized: by default one replica written locally, second in the same rack, and a third replica in another rack (to combat against rack failures, e.g., rack switch or rack power feed) Uses large block size, 128 MB is a common default - designed for batch processing For scalability: Write once, read many filesystem 33/77

34 Implications of Write Once All applications need to be re-engineered to only do sequential writes. Example systems working on top of HDFS: HBase (Hadoop Database), a database system with only sequential writes, Google Bigtable clone MapReduce batch processing system Apache Pig and Hive data mining tools Mahout machine learning libraries Lucene and Solr full text search Nutch web crawling 34/77

35 Two Large Hadoop Installations Yahoo! (2009): 4000 nodes, 16 PB raw disk, 64TB RAM, 32K cores Facebook (2010): 2000 nodes, 21 PB storage, 64TB RAM, 22.4K cores 12 TB (compressed) data added per day, 800TB (compressed) data scanned per day A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, H. Liu: Data warehousing and analytics infrastructure at Facebook. SIGMOD Conference 2010: /77

36 Writing Hadoop Jobs See Hadoop installation instructions at: /practical_matters Example: Assume we want to count the words in a file using Apache Hadoop What we need to write in Java: Code to include the needed Hadoop Java Libraries Code for the Mapper Code for the Reducer (Optionally: Code for the Combiner) Code for the main method for Job configuration and submission 36/77

37 WordCount: Including Hadoop Java Libraries package org.apache.hadoop.examples; import java.io.ioexception; import java.util.stringtokenizer; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.mapreduce.lib.input. FileInputFormat; import org.apache.hadoop.mapreduce.lib.output. FileOutputFormat; import org.apache.hadoop.util.genericoptionsparser; 37/77

38 WordCount: Including Hadoop Java Libraries The package org.apache.hadoop.examples gives the package name we are defining The java.util.stringtokenizer is a utility to parse a string into a list of words by returning an iterator over the words of the string Most org.apache.hadoop.* libraries listed are needed by all Hadoop jobs, cut&paste from examples 38/77

39 WordCount: Including Hadoop Java Libraries The IntWritable is a Hadoop variant of the Java Integer type. The Text is a Hadoop variant of the Java String class In addition: the LongWritable is a Hadoop variant of the Java Long type, the FloatWritable is a Hadoop variant of the Java Float type, etc. The Hadoop types must be used for both Mapper and Reducer keys and values instead of Java native types 39/77

40 WordCount: Code for the Mapper public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); } public void map(object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); } } 40/77

41 WordCount: Code for the Mapper The line public class WordCount tells we are defining the WordCount class, into which Mapper, Reducer, and main method all belong into The line public static class TokenizerMapper tells the name of the Mapper we are defining The four parameters of the Mapper in: extends Mapper<Object, Text, Text, IntWritable> are the types of the input key, input value, output key, and output value, respectively The key argument is not used by the WordCount Mapper, but by default it is a long integer offset for text inputs 41/77

42 WordCount: Code for the Mapper Remember that Hadoop keys and values are of Hadoop internal types: Text, IntWritable, LongWritable, FloatWritable, etc. Code is needed to covert between them and Java native types. Because of this, define the Hadoop types for creating the Mapper output: The code: private final static IntWritable one = new IntWritable(1) creates an IntWritable constant 1 The code: private Text word = new Text() creates a Text object to be used as output key object of the Mapper 42/77

43 WordCount: Code for the Mapper The Mapper method is defined as: public void map( Object key, Text value, Context context), where Context is an object used to write the output of the Mapper The Java code: StringTokenizer itr = new StringTokenizer(value.toString()) parses a Text value converted into Java String into a list of words by returning an iterator itr over the words of the String The code word.set(itr.nexttoken()) overwrites the current contents of the Text object word by the next String contents returned from the iterator itr The code context.write(word, one) outputs the parsed word as key, and the constant 1 as the value from the Mapper giving an output pair (word,1) of (Text, IntWritable) type 43/77

44 WordCount: Code for the Reducer public static class IntSumReducer extends Reducer<Text,IntWritable,Text, IntWritable> { private IntWritable result = new IntWritable(); } public void reduce(text key, Iterable<IntWritable > values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } 44/77

45 WordCount: Code for the Reducer The line public static class IntSumReducer gives the name of the Reducer The four parameters of the Reducer in: extends Reducer<Text,IntWritable,Text,IntWritable> are the types of the input key, input value, output key, and output value, respectively. The input types of Reducer must match the output types of the Mapper The code private IntWritable result = new IntWritable() generates a IntWritable object to be used as output value of the Reducer 45/77

46 WordCount: Code for the Reducer The Reducer method is defined as: public void reduce(text key, Iterable<IntWritable> values, Context context), where Iterable<IntWritable> values is an iterator over the value type of the Mapper output, and Context is an object used to write the output of the Reducer The reducer will be called once per each key, and will be given a list of values mapped to that key to iterate on using the values iterator 46/77

47 WordCount: Code for the Reducer The loop for (IntWritable val : values) iterates over the values mapped to the current key and sums them up The code result.set(sum) basically sets the IntWritable object result using the sum variable, remember that native Java classes can not be directly used as outputs The code context.write(key, result) writes the Reducer output as a (word, count) pair of (Text, IntWritable) type 47/77

48 WordCount: Code for main 1/2 public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); String[] otherargs = new GenericOptionsParser( conf, args).getremainingargs(); if (otherargs.length!= 2) { System.err.println("Usage: wordcount <in> <out >"); System.exit(2); } 48/77

49 WordCount: Code for main 1/2 The line public static void main(string[] args) declares the main routine that will be running on the client machine that starts the MapReduce job The line Configuration conf = new Configuration() creates the configuration object conf holding all the information of the MapReduce job to be created The line String[] otherargs = new GenericOptionsParser (conf, args).getremainingargs() and code after it does argument processing and checking that the code is called with two arguments 49/77

50 WordCount: Code for main 2/2 } } Job job = new Job(conf, "word count"); job.setjarbyclass(wordcount.class); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path( otherargs[0])); FileOutputFormat.setOutputPath(job, new Path( otherargs[1])); System.exit(job.waitForCompletion(true)? 0 : 1); 50/77

51 WordCount: Code for main 2/2 The code Job job = new Job(conf, "word count") creates a new MapReduce Job with name word count using the created configuration object conf The line job.setjarbyclass(wordcount.class) tells the name of the defined class The Mapper is set: job.setmapperclass(tokenizermapper.class) The (optional!) Combiner is set: job.setcombinerclass(intsumreducer.class), note that this is sound only when the Reduce function is commutative and associative! (Addition is!) The Reducer is set: job.setreducerclass(intsumreducer.class) 51/77

52 WordCount: Code for main 2/2 Java type system is not able to figure out correct types of output keys and values, so we have to specify them again: The code job.setoutputkeyclass(text.class) sets the output key of both Mapper and Reducer to the Text type. If Mapper has a different output key type from Reducer, the Mapper key type can be specified by overriding this using job.setmapoutputkeyclass() The code job.setoutputvalueclass(intwritable.class) sets the output key of both Mapper and Reducer to the IntWritable type. Also job.setmapoutputvalueclass() can be used to override this if the Mapper output type is different Also because of the Java limitations, it does not complain at compile time if Mapper outputs and Reducer inputs are of incompatible types 52/77

53 WordCount: Code for main 2/2 The code FileInputFormat.addInputPath(job, new Path( otherargs[0])) sets the input to be all of files in input directory specified as first argument on command line The code FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])) sets the output directed to the output directory specified as second argument on command line, one file per reducer The code System.exit(job.waitForCompletion(true)? 0 : 1) submits the job to Hadoop, waits for the job to finish, and returns 0 on success, 1 on failure 53/77

54 A Quick WordCount Demo: Formatting HDFS $ hadoop namenode -format 11/09/20 01:39:29 INFO namenode.namenode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = nipsu.local/ STARTUP_MSG: args = [-format] STARTUP_MSG: version = STARTUP_MSG: build = branch-0.20-security-203 -r ; compiled by oom on Wed May 4 07:57:50 PDT 2011 ************************************************************/ 11/09/20 01:39:29 INFO util.gset: VM type = 64-bit 11/09/20 01:39:29 INFO util.gset: 2% max memory = MB 11/09/20 01:39:29 INFO util.gset: capacity = 2ˆ21 = entries 11/09/20 01:39:29 INFO util.gset: recommended= , actual= /09/20 01:39:30 INFO namenode.fsnamesystem: fsowner=keijoheljanko 11/09/20 01:39:30 INFO namenode.fsnamesystem: supergroup=supergroup 11/09/20 01:39:30 INFO namenode.fsnamesystem: ispermissionenabled=true 11/09/20 01:39:30 INFO namenode.fsnamesystem: dfs.block.invalidate.limit=100 11/09/20 01:39:30 INFO namenode.fsnamesystem: isaccesstokenenabled=false accesskeyupdateinterval=0 min(s), accesstokenlifetime=0 min(s) 11/09/20 01:39:30 INFO namenode.namenode: Caching file names occuring more than 10 times 11/09/20 01:39:30 INFO common.storage: Image file of size 119 saved in 0 seconds. 11/09/20 01:39:30 INFO common.storage: Storage directory /tmp/hadoop-keijoheljanko/dfs/name has been successfully formatted. 11/09/20 01:39:30 INFO namenode.namenode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at nipsu.local/ ************************************************************/ 54/77

55 A Quick WordCount Demo: Starting Hadoop $ start-all.sh starting namenode, logging to /Users/keijoheljanko/hadoop/hadoop /bin/.. /logs/hadoop-keijoheljanko-namenode-nipsu.local.out localhost: starting datanode, logging to /Users/keijoheljanko/hadoop/hadoop /bin/.. /logs/hadoop-keijoheljanko-datanode-nipsu.local.out localhost: starting secondarynamenode, logging to /Users/keijoheljanko/hadoop/hadoop /bin/.. /logs/hadoop-keijoheljanko-secondarynamenode-nipsu.local.out starting jobtracker, logging to /Users/keijoheljanko/hadoop/hadoop /bin/.. /logs/hadoop-keijoheljanko-jobtracker-nipsu.local.out localhost: starting tasktracker, logging to /Users/keijoheljanko/hadoop/hadoop /bin/.. /logs/hadoop-keijoheljanko-tasktracker-nipsu.local.out 55/77

56 A Quick WordCount Demo: Uploading Kalevala to HDFS $ wget :41: Resolving ( , 2001:708:10:9::20:3 Connecting to ( :80... connected. HTTP request sent, awaiting response OK Length: (196K) [application/x-gzip] Saving to: kalevala.txt.gz 100%[===============================================================================>] 200, :41:24 (3.06 MB/s) - kalevala.txt.gz saved [200872/200872] $ gunzip kalevala.txt.gz $ recode ISO UTF-8 kalevala.txt $ ls -al kalevala.txt -rw-r--r-- 1 keijoheljanko staff Nov kalevala.txt $ hadoop fs -mkdir input $ hadoop fs -copyfromlocal kalevala.txt input $ hadoop fs -ls input Found 1 items -rw-r--r-- 1 keijoheljanko supergroup :46 /user/keijoheljanko/input/ 56/77

57 A Quick WordCount Demo: Compiling WordCount $ ls -al WordCount.java -rw-r--r--@ 1 keijoheljanko staff 2395 Sep 18 13:53 WordCount.java $ javac -classpath ${HADOOP_HOME}/hadoop-core-${HADOOP_VERSION}.jar:{HADOOP_HOME}/lib/commons-cli-1.2.jar -d wordcount_classes WordCount.java $ jar -cvf wordcount.jar -C wordcount_classes/. added manifest adding: org/(in = 0) (out= 0)(stored 0%) adding: org/apache/(in = 0) (out= 0)(stored 0%) adding: org/apache/hadoop/(in = 0) (out= 0)(stored 0%) adding: org/apache/hadoop/examples/(in = 0) (out= 0)(stored 0%) adding: org/apache/hadoop/examples/wordcount$intsumreducer.class(in = 1789) (out= 746)(deflated 58%) adding: org/apache/hadoop/examples/wordcount$tokenizermapper.class(in = 1790) (out= 765)(deflated 57%) adding: org/apache/hadoop/examples/wordcount.class(in = 1911) (out= 996)(deflated 47%) $ ls -al wordcount.jar -rw-r--r-- 1 keijoheljanko staff 3853 Sep 20 02:07 wordcount.jar 57/77

58 A Quick WordCount Demo: Running WordCount (1/2) $ hadoop jar wordcount.jar org.apache.hadoop.examples.wordcount input output 11/09/20 02:11:32 INFO input.fileinputformat: Total input paths to process : 1 11/09/20 02:11:32 INFO mapred.jobclient: Running job: job_ _ /09/20 02:11:33 INFO mapred.jobclient: map 0% reduce 0% 11/09/20 02:11:47 INFO mapred.jobclient: map 100% reduce 0% 11/09/20 02:12:00 INFO mapred.jobclient: map 100% reduce 100% 11/09/20 02:12:05 INFO mapred.jobclient: Job complete: job_ _ /09/20 02:12:05 INFO mapred.jobclient: Counters: 25 11/09/20 02:12:05 INFO mapred.jobclient: Job Counters 11/09/20 02:12:05 INFO mapred.jobclient: Launched reduce tasks=1 11/09/20 02:12:05 INFO mapred.jobclient: SLOTS_MILLIS_MAPS= /09/20 02:12:05 INFO mapred.jobclient: Total time spent by all reduces waiting after reserving slots (ms)=0 11/09/20 02:12:05 INFO mapred.jobclient: Total time spent by all maps waiting after reserving slots (ms)=0 11/09/20 02:12:05 INFO mapred.jobclient: Launched map tasks=1 11/09/20 02:12:05 INFO mapred.jobclient: Data-local map tasks=1 11/09/20 02:12:05 INFO mapred.jobclient: SLOTS_MILLIS_REDUCES= /09/20 02:12:05 INFO mapred.jobclient: File Output Format Counters 11/09/20 02:12:05 INFO mapred.jobclient: Bytes Written= /77

59 A Quick WordCount Demo: Running WordCount (2/2) 11/09/20 02:12:05 INFO mapred.jobclient: FileSystemCounters 11/09/20 02:12:05 INFO mapred.jobclient: FILE_BYTES_READ= /09/20 02:12:05 INFO mapred.jobclient: HDFS_BYTES_READ= /09/20 02:12:05 INFO mapred.jobclient: FILE_BYTES_WRITTEN= /09/20 02:12:05 INFO mapred.jobclient: HDFS_BYTES_WRITTEN= /09/20 02:12:05 INFO mapred.jobclient: File Input Format Counters 11/09/20 02:12:05 INFO mapred.jobclient: Bytes Read= /09/20 02:12:05 INFO mapred.jobclient: Map-Reduce Framework 11/09/20 02:12:05 INFO mapred.jobclient: Reduce input groups= /09/20 02:12:05 INFO mapred.jobclient: Map output materialized bytes= /09/20 02:12:05 INFO mapred.jobclient: Combine output records= /09/20 02:12:05 INFO mapred.jobclient: Map input records= /09/20 02:12:05 INFO mapred.jobclient: Reduce shuffle bytes= /09/20 02:12:05 INFO mapred.jobclient: Reduce output records= /09/20 02:12:05 INFO mapred.jobclient: Spilled Records= /09/20 02:12:05 INFO mapred.jobclient: Map output bytes= /09/20 02:12:05 INFO mapred.jobclient: Combine input records= /09/20 02:12:05 INFO mapred.jobclient: Map output records= /09/20 02:12:05 INFO mapred.jobclient: SPLIT_RAW_BYTES=124 11/09/20 02:12:05 INFO mapred.jobclient: Reduce input records= /77

60 A Quick WordCount Demo: Dowloading Output $ hadoop fs -copytolocal output output $ ls -al output total 624 drwxr-xr-x 5 keijoheljanko staff 170 Sep 20 02:16. drwxr-xr-x 17 keijoheljanko staff 578 Sep 20 02:16.. -rw-r--r-- 1 keijoheljanko staff 0 Sep 20 02:16 _SUCCESS drwxr-xr-x 3 keijoheljanko staff 102 Sep 20 02:16 _logs -rw-r--r-- 1 keijoheljanko staff Sep 20 02:16 part-r $ head -15 output/part-r "Ahto,1 "Aik 1 "Aika 1 "Ain 1 "Aina 2 "Ainap 1 "Ainapa 1 "Aita 2 "Ajoa 1 "Akat 1 "Akatp 1 "Akka 5 "Ala 1 "Ampuisitko 1 "Anna 7 60/77

61 Hadoop Books I can warmly recommend the Book: Tom White: Hadoop: The Definitive Guide, Second Edition, O Reilly Media, Print ISBN: , Ebook ISBN: An alternative book is: Chuck Lam: Hadoop in Action, Manning, Print ISBN: /77

62 Recap of Hadoop Commands Used start-all.sh Starts all Hadoop daemons. Hint: If you are getting the following error on Mac OSX: Bad connection to FS. command aborted. exception: Call to localhost/ :9000 failed on connection exception: java.net.connectexception: Connection refused call start-all.sh another time to start the namenode for real. 62/77

63 Recap of Hadoop Commands Used (cnt.) hadoop namenode -format Formats the HDFS filesystem hadoop fs -mkdir foo Creates a directory called foo in the HDFS home directory of the user hadoop fs -rmr foo Removes the directory/file called foo in the HDFS home directory of the user hadoop fs -copyfromlocal foo.txt bar Copies the file foo.txt to the HDFS directory/file called bar in the HDFS home directory of the user 63/77

64 Recap of Hadoop Commands Used (cnt.) hadoop fs -ls foo Does a directory/file listing for directory called foo in the HDFS home directory of the user hadoop jar wordcount.jar org.apache.hadoop.examples.wordcount foo bar Runs MapReduce job in file wordcount.jar in the class org.apache.hadoop.examples.wordcount using the HDFS directory foo as input and stores the output in a new HDFS directory bar to be created by the job (i.e., bar should not yet exist in HDFS) 64/77

65 Recap of Hadoop Commands Used (cnt.) hadoop fs -copytolocal foo bar Copies the directory/file foo in the user HDFS home directory to the local file system directory/file bar stop-all.sh Stops all Hadoop daemons 65/77

66 Hadoop Default ports Use the URL: to access the JobTracker through a Web browser. Hint: Use the refresh button to update the view during Job execution. Use the URL: to access the NameNode through a Web browser. Hint: You can also browse HDFS filesystem through this URL. 66/77

67 Hadoop NameNode and SecondaryNameNode Locking Hadoop Namenode and SecondaryNameNode use locking to disallow the accidental starting of two daemons modifying the same files and leading to HDFS inconsistency If either daemon gets killed uncleanly, it they might leave lock files around and prevent a new NameNode or SecondaryNameNode from starting The lock files are called data/in use.lock, name/in use.lock, and namesecondary/in use.lock. They are located in dfs subdirectory of hadoop.tmp.dir that can be set in core-site.xml (by default /tmp/hadoop-${user.name}) 67/77

68 Commercial Hadoop Support Cloudera: A Hadoop distribution based on Apache Hadoop + patches. Available from: Hortonworks: Yahoo! spin-off of their Hadoop development team, will be working on mainline Apache Hadoop. MapR: A rewrite of much of Apache Hadoop in C++, including a new filesystem. API-compatible with Apache Hadoop. 68/77

69 Cloud Software Project ICT SHOK Program Cloud Software ( ) A large Finnish consortium, see: Case study at Aalto: CSC Genome Browser Cloud Backend Co-authors: Matti Niemenmaa, André Schumacher (Aalto University, Department of Information and Computer Science), Aleksi Kallio, Eija Korpelainen, Taavi Hupponen, and Petri Klemelä (CSC IT Center for Science) 69/77

70 CSC Genome Browser CSC provides tools and infrastructure for bioinformatics Bioinformatics is the largest customer group of CSC (in user numbers) Next-Generation Sequencing (NGS) produces large data sets (TB+) Cloud computing can be harnessed for analyzing these data sets 1000 Genomes project ( Freely available 50 TB+ data set of human genomes 70/77

71 CSC Genome Browser Cloud computing technologies will enable scalable NGS data analysis There exists prior research into sequence alignment and assembly in the cloud Visualization is needed in order to understand large data masses Interactive visualization for 100GB+ datasets can only be achieved with preprocessing in the cloud 71/77

72 Genome Browser Requirements Interactive browsing with zooming in and out, Google Earth -style Single datasets 100GB-1TB+ with interactive visualization at different zoom levels Preprocessing used to compute summary data for the higher zoom levels Dataset too large to compute the summary data in real time using the real dataset Scalable cloud data processing paradigm map-reduce implemented in Hadoop used to compute the summary data in preprocessing (currently upto 15 x 12 = 180 cores) 72/77

73 Genome Browser GUI 73/77

74 Experiments One 50 GB (compressed) BAM input file from 1000 Genomes Run on the Triton cluster 1-15 compute nodes with 12 cores each Four repetitions for increasing number of worker nodes Two types of runs: sorting according to starting position ( sorted ) and read aggregation using five summary-block sizes at once ( summarized ) 74/77

75 Mean speedup 50 GB sorted 50 GB summarized for B=2,4,8,16, Ideal Input file import Sorting Output file export Total elapsed Ideal Input file import Summarizing Output file export Total elapsed Mean speedup Mean speedup Workers Workers 75/77

76 Results An updated CSC Genome browser GUI Chipster: Tools and framework for producing visualizable data An implementation of preprocessing for visualization using Hadoop Scalability studies for running Hadoop on 50 GB+ datasets on the Triton cloud testbed Software released, with more than 1200 downloads: 76/77

77 Current Research Topics High level programming frameworks for ad-hoc queries: Apache Pig integration of Hadoop-BAM in the SeqPig tool ( and SQL integration in Apache Hive Real-time SQL query evaluation on Big Data: Cloudera Impala and Berkeley Shark Scalable datastores: HBase (Hadoop Database) for real-time read/write database workloads Energy efficiency through better main memory and SSD utilization 77/77

Scalable Cloud Computing

Scalable Cloud Computing Keijo Heljanko Department of Information and Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 1/44 Business Drivers of Cloud Computing Large data centers allow for economics

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

CSE-E5430 Scalable Cloud Computing. Lecture 4

CSE-E5430 Scalable Cloud Computing. Lecture 4 Lecture 4 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 5.10-2015 1/23 Hadoop - Linux of Big Data Hadoop = Open Source Distributed Operating System

More information

Tutorial- Counting Words in File(s) using MapReduce

Tutorial- Counting Words in File(s) using MapReduce Tutorial- Counting Words in File(s) using MapReduce 1 Overview This document serves as a tutorial to setup and run a simple application in Hadoop MapReduce framework. A job in Hadoop MapReduce usually

More information

Big Data and Industrial Internet

Big Data and Industrial Internet Big Data and Industrial Internet Keijo Heljanko Department of Computer Science and Helsinki Institute for Information Technology HIIT School of Science, Aalto University keijo.heljanko@aalto.fi 16.6-2015

More information

BIG DATA APPLICATIONS

BIG DATA APPLICATIONS BIG DATA ANALYTICS USING HADOOP AND SPARK ON HATHI Boyu Zhang Research Computing ITaP BIG DATA APPLICATIONS Big data has become one of the most important aspects in scientific computing and business analytics

More information

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010 Hadoop Lab Notes Nicola Tonellotto November 15, 2010 2 Contents 1 Hadoop Setup 4 1.1 Prerequisites........................................... 4 1.2 Installation............................................

More information

Hadoop-BAM and SeqPig

Hadoop-BAM and SeqPig Hadoop-BAM and SeqPig Keijo Heljanko 1, André Schumacher 1,2, Ridvan Döngelci 1, Luca Pireddu 3, Matti Niemenmaa 1, Aleksi Kallio 4, Eija Korpelainen 4, and Gianluigi Zanetti 3 1 Department of Computer

More information

Processing NGS Data with Hadoop-BAM and SeqPig

Processing NGS Data with Hadoop-BAM and SeqPig Processing NGS Data with Hadoop-BAM and SeqPig Keijo Heljanko 1, André Schumacher 1,2, Ridvan Döngelci 1, Luca Pireddu 3, Matti Niemenmaa 1, Aleksi Kallio 4, Eija Korpelainen 4, and Gianluigi Zanetti 3

More information

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 Hadoop and HDInsight in a Heartbeat HDInsight Essentials Rajesh Nadipalli Chapter No. 1 "Hadoop and HDInsight in a Heartbeat" In this package, you will find: A Biography of the author of the book A preview chapter from the book, Chapter

More information

Mrs: MapReduce for Scientific Computing in Python

Mrs: MapReduce for Scientific Computing in Python Mrs: for Scientific Computing in Python Andrew McNabb, Jeff Lund, and Kevin Seppi Brigham Young University November 16, 2012 Large scale problems require parallel processing Communication in parallel processing

More information

Big Data 2012 Hadoop Tutorial

Big Data 2012 Hadoop Tutorial Big Data 2012 Hadoop Tutorial Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 1 Contact Exercise Session Friday 14.15 to 15.00 CHN D 46 Your Assistant Martin Kaufmann Office: CAB E 77.2 E-Mail:

More information

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:

More information

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab IBM CDL Lab Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网 Information Management 2012 IBM Corporation Agenda Hadoop 技 术 Hadoop 概 述 Hadoop 1.x Hadoop 2.x Hadoop 生 态

More information

Scalable Cloud Computing

Scalable Cloud Computing Scalable Cloud Computing Keijo Heljanko Department of Information and Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 10.10-2012 1/57 Guest Lecturer Guest Lecturer: Prof. Keijo

More information

Introduc)on to Map- Reduce. Vincent Leroy

Introduc)on to Map- Reduce. Vincent Leroy Introduc)on to Map- Reduce Vincent Leroy Sources Apache Hadoop Yahoo! Developer Network Hortonworks Cloudera Prac)cal Problem Solving with Hadoop and Pig Slides will be available at hgp://lig- membres.imag.fr/leroyv/

More information

Xiaoming Gao Hui Li Thilina Gunarathne

Xiaoming Gao Hui Li Thilina Gunarathne Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Getting to know Apache Hadoop

Getting to know Apache Hadoop Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the

More information

Enterprise Data Storage and Analysis on Tim Barr

Enterprise Data Storage and Analysis on Tim Barr Enterprise Data Storage and Analysis on Tim Barr January 15, 2015 Agenda Challenges in Big Data Analytics Why many Hadoop deployments under deliver What is Apache Spark Spark Core, SQL, Streaming, MLlib,

More information

Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology Working With Hadoop Now that we covered the basics of MapReduce, let s look at some Hadoop specifics. Mostly based on Tom White s book Hadoop: The Definitive Guide, 3 rd edition Note: We will use the new

More information

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com Hadoop and Eclipse Eclipse Hawaii User s Group May 26th, 2009 Seth Ladd http://sethladd.com Goal YOU can use the same technologies as The Big Boys Google Yahoo (2000 nodes) Last.FM AOL Facebook (2.5 petabytes

More information

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Hadoop Basics with InfoSphere BigInsights

Hadoop Basics with InfoSphere BigInsights An IBM Proof of Technology Hadoop Basics with InfoSphere BigInsights Unit 2: Using MapReduce An IBM Proof of Technology Catalog Number Copyright IBM Corporation, 2013 US Government Users Restricted Rights

More information

The Hadoop Eco System Shanghai Data Science Meetup

The Hadoop Eco System Shanghai Data Science Meetup The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

hadoop Running hadoop on Grid'5000 Vinicius Cogo vielmo@lasige.di.fc.ul.pt Marcelo Pasin pasin@di.fc.ul.pt Andrea Charão andrea@inf.ufsm.

hadoop Running hadoop on Grid'5000 Vinicius Cogo vielmo@lasige.di.fc.ul.pt Marcelo Pasin pasin@di.fc.ul.pt Andrea Charão andrea@inf.ufsm. hadoop Running hadoop on Grid'5000 Vinicius Cogo vielmo@lasige.di.fc.ul.pt Marcelo Pasin pasin@di.fc.ul.pt Andrea Charão andrea@inf.ufsm.br Outline 1 Introduction 2 MapReduce 3 Hadoop 4 How to Install

More information

Hadoop Overview. July 2011. Lavanya Ramakrishnan Iwona Sakrejda Shane Canon. Lawrence Berkeley National Lab

Hadoop Overview. July 2011. Lavanya Ramakrishnan Iwona Sakrejda Shane Canon. Lawrence Berkeley National Lab Hadoop Overview Lavanya Ramakrishnan Iwona Sakrejda Shane Canon Lawrence Berkeley National Lab July 2011 Overview Concepts & Background MapReduce and Hadoop Hadoop Ecosystem Tools on top of Hadoop Hadoop

More information

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems Introduction PRESENTATION to Hadoop, TITLE GOES MapReduce HERE and HDFS for Big Data Applications Serge Blazhievsky Nice Systems SNIA Legal Notice The material contained in this tutorial is copyrighted

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013 Hadoop Scalable Distributed Computing Claire Jaja, Julian Chan October 8, 2013 What is Hadoop? A general-purpose storage and data-analysis platform Open source Apache software, implemented in Java Enables

More information

MapReduce, Hadoop and Amazon AWS

MapReduce, Hadoop and Amazon AWS MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics

Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics www.thinkbiganalytics.com 520 San Antonio Rd, Suite 210 Mt. View, CA 94040 (650) 949-2350 Table of Contents OVERVIEW

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 3. Apache Hadoop Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Apache Hadoop Open-source

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2 USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2 (Using HDFS on Discovery Cluster for Discovery Cluster Users email n.roy@neu.edu if you have questions or need more clarifications. Nilay

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

MapReduce with Apache Hadoop Analysing Big Data

MapReduce with Apache Hadoop Analysing Big Data MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues

More information

MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris

More information

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015 7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan aidhog@gmail.com Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE

More information

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

!#$%&' ( )%#*'+,'-#.//0( !#$%&'()*$+()',!-+.'/', 4(5,67,!-+!89,:*$;'0+$.<.,&0$'09,&)/=+,!()<>'0, 3, Processing LARGE data sets !"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment James Devine December 15, 2008 Abstract Mapreduce has been a very successful computational technique that has

More information

What s Big Data? Big Data: 3V s. Variety (Complexity) 5/5/2016. Introduction to Big Data, mostly from www.cs.kent.edu/~jin/bigdata by Ruoming Jin

What s Big Data? Big Data: 3V s. Variety (Complexity) 5/5/2016. Introduction to Big Data, mostly from www.cs.kent.edu/~jin/bigdata by Ruoming Jin data every day 5/5/2016 Introduction to Big Data, mostly from www.cs.kent.edu/~jin/bigdata by Ruoming Jin What s Big Data? No single definition; here is from Wikipedia: Big data is the term for a collection

More information

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 Lambda Architecture CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 1 Goals Cover the material in Chapter 8 of the Concurrency Textbook The Lambda Architecture Batch Layer MapReduce

More information

Hadoop Installation MapReduce Examples Jake Karnes

Hadoop Installation MapReduce Examples Jake Karnes Big Data Management Hadoop Installation MapReduce Examples Jake Karnes These slides are based on materials / slides from Cloudera.com Amazon.com Prof. P. Zadrozny's Slides Prerequistes You must have an

More information

Hadoop Configuration and First Examples

Hadoop Configuration and First Examples Hadoop Configuration and First Examples Big Data 2015 Hadoop Configuration In the bash_profile export all needed environment variables Hadoop Configuration Allow remote login Hadoop Configuration Download

More information

Big Data Too Big To Ignore

Big Data Too Big To Ignore Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction

More information

CS54100: Database Systems

CS54100: Database Systems CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

Intro to Map/Reduce a.k.a. Hadoop

Intro to Map/Reduce a.k.a. Hadoop Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan sriram@sdsc.edu Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.

More information

Application Development. A Paradigm Shift

Application Development. A Paradigm Shift Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the

More information

Big Data for the JVM developer. Costin Leau, Elasticsearch @costinl

Big Data for the JVM developer. Costin Leau, Elasticsearch @costinl Big Data for the JVM developer Costin Leau, Elasticsearch @costinl Agenda Data Trends Data Pipelines JVM and Big Data Tool Eco-system Data Landscape Data Trends http://www.emc.com/leadership/programs/digital-universe.htm

More information

BIG DATA, MAPREDUCE & HADOOP

BIG DATA, MAPREDUCE & HADOOP BIG, MAPREDUCE & HADOOP LARGE SCALE DISTRIBUTED SYSTEMS By Jean-Pierre Lozi A tutorial for the LSDS class LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 1 OBJECTIVES OF THIS LAB SESSION The LSDS

More information

Word Count Code using MR2 Classes and API

Word Count Code using MR2 Classes and API EDUREKA Word Count Code using MR2 Classes and API A Guide to Understand the Execution of Word Count edureka! A guide to understand the execution and flow of word count WRITE YOU FIRST MRV2 PROGRAM AND

More information

MapReduce. Tushar B. Kute, http://tusharkute.com

MapReduce. Tushar B. Kute, http://tusharkute.com MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately

More information

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA TRENDS AND TECHNOLOGIES BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.

More information

Distributed Filesystems

Distributed Filesystems Distributed Filesystems Amir H. Payberah Swedish Institute of Computer Science amir@sics.se April 8, 2014 Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 1 / 32 What is Filesystem? Controls

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Hands-on Exercises with Big Data

Hands-on Exercises with Big Data Hands-on Exercises with Big Data Lab Sheet 1: Getting Started with MapReduce and Hadoop The aim of this exercise is to learn how to begin creating MapReduce programs using the Hadoop Java framework. In

More information

From Wikipedia, the free encyclopedia

From Wikipedia, the free encyclopedia Page 1 sur 5 Hadoop From Wikipedia, the free encyclopedia Apache Hadoop is a free Java software framework that supports data intensive distributed applications. [1] It enables applications to work with

More information

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea Overview Riding Google App Engine Taming Hadoop Summary Riding

More information

Comparison of Different Implementation of Inverted Indexes in Hadoop

Comparison of Different Implementation of Inverted Indexes in Hadoop Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN

Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Hadoop Framework technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate addi8onal

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

More information

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology Hadoop Dawid Weiss Institute of Computing Science Poznań University of Technology 2008 Hadoop Programming Summary About Config 1 Open Source Map-Reduce: Hadoop About Cluster Configuration 2 Programming

More information

Hadoop WordCount Explained! IT332 Distributed Systems

Hadoop WordCount Explained! IT332 Distributed Systems Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,

More information

Introduction to Big Data Science. Wuhui Chen

Introduction to Big Data Science. Wuhui Chen Introduction to Big Data Science Wuhui Chen What is Big data? Volume Variety Velocity Outline What are people doing with Big data? Classic examples Two basic technologies for Big data management: Data

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing MapReduce and Hadoop 15 319, spring 2010 17 th Lecture, Mar 16 th Majd F. Sakr Lecture Goals Transition to MapReduce from Functional Programming Understand the origins of

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

How To Use Hadoop

How To Use Hadoop Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop

More information

Hadoop: Understanding the Big Data Processing Method

Hadoop: Understanding the Big Data Processing Method Hadoop: Understanding the Big Data Processing Method Deepak Chandra Upreti 1, Pawan Sharma 2, Dr. Yaduvir Singh 3 1 PG Student, Department of Computer Science & Engineering, Ideal Institute of Technology

More information