Tutorial, IEEE SERVICE 2014 Anchorage, Alaska Big Data Science: Fundamental, Techniques, and Challenges (HDFS & Map-Reduce)

Size: px
Start display at page:

Download "Tutorial, IEEE SERVICE 2014 Anchorage, Alaska Big Data Science: Fundamental, Techniques, and Challenges (HDFS & Map-Reduce)"

Transcription

1 Tutorial, IEEE SERVICE 2014 Anchorage, Alaska Big Data Science: Fundamental, Techniques, and Challenges (HDFS & Map-Reduce) Incheon Paik University of Aizu, Japan Big Data Science 1

2 Contents What is Big Data? Big Data Source Clustered File System & Distributed File System Hadoop Distributed File System (HDFS) Map-Reduce Operation Application of Map-Reduce Big Data Science 2

3 Overview of Big Data Science What is Big Data? Recently, there have been very large and complex data sets from nature, sensors, social networks, enterprises increasingly based on high speed computers and networks together. Big data is the term for a collection of the data sets that it becomes difficult to process using on-hand database management tools or traditional data processing applications. Big Data Science 3

4 Overview of Big Data Science Several Areas Meteorology Genomics Connectomics Complex physics simulations Biological and environmental research Internet search Finance and business informatics Data sets ubiquitous information-sensing mobile devices Aerial sensory technologies (remote sensing) Software logs Cameras Microphones Radio-frequency identification readers Wireless sensor networks Big Data Science 4

5 Clustered File System A file system which is shared by being simultaneously mounted on multiple servers. Clustered file systems can provide features like Location-independent addressing Redundancy which improve reliability or reduce the complexity of the other parts of the cluster. Parallel file systems are a type of clustered file system that spread data across multiple storage nodes, usually for redundancy or performance. Big Data Science 5

6 Clustered File System (DAS/NAS/SAN) Source: Big Data Science 6

7 Fundamental Differences in DAS/NAS/SAN Source: he.wikipedia.org Big Data Science 7

8 Storage Area Network Source: pccgroup.com Big Data Science 8

9 Shared-Disk / Storage Area Network A shared-disk file-system uses a storage-area network (SAN) to provide direct disk access from multiple computers at the block level. Access control and translation from file-level operations (application) to block-level operations (by SAN) must take place on the client node. Adds a mechanism for concurrency control which gives a consistent and serializable view of the file system, avoiding corruption and unintended data loss Usually employ some sort of a fencing mechanism to prevent data corruption in case of node failures. Big Data Science 9

10 Shared-Disk / Storage Area Network The SAN may use any of a number of block-level protocols, including SCSI, iscsi, HyperSCSI, ATA over Ethernet, Fibre Channel, etc. There are different architectural approaches to a shared-disk file-system. Distribute file information across all the servers in a cluster (fully distributed). Utilize a centralized metadata server. Both achieve the same result of enabling all servers to access all the data on a shared storage device. Big Data Science 10

11 Network Attached Storage (NAS) Source: Big Data Science 11

12 Network Attached Storage (NAS) Provides both storage and a file system, like a shared disk file system on top of a SAN. Typically uses file-based protocols (as opposed to block-based protocols a SAN would use) such as NFS, SMB/CIFS, AFP, or NCP. Design Considerations Avoiding single point of failure: Fault tolerance and high availability by data replication of one sort or another Performance: fast disk-access time and small amount of CPU-processing time over distributed structure Concurrency: for consistent and efficient multiple accesses to the same file or block by concurrency control or locking which may either be built into the file system or provided by an add-on protocol Big Data Science 12

13 NAS vs. SAN Source: turbotekcomputer.com Big Data Science 13

14 Distributed File System Distributed file systems do not share block level access to the same storage but use a network protocol. These are commonly known as network file systems, even though they are not the only file systems that use the network to send data. Distributed file systems can restrict access to the file system depending on access lists or capabilities on both the servers and the clients, depending on how the protocol is designed. Big Data Science 14

15 Distributed File System The difference between a distributed file system and a distributed data store Distributed file system (DFS) allows files to be accessed using the same interfaces and semantics as local files - e.g. mounting/unmounting, listing directories, read/write at byte boundaries, system's native permission model. Distributed data stores, by contrast, require using a different API or library and have different semantics (most often those of a database). Big Data Science 15

16 Distributed File System Design Goal of DFS Access transparency: Clients are unaware that files are distributed and can access them in the same way as local files are accessed. Location transparency: A consistent name space exists encompassing local as well as remote files. The name of a file does not give its location. Concurrency transparency: All clients have the same view of the state of the file system. This means that if one process is modifying a file, any other processes on the same system or remote systems that are accessing the files will see the modifications in a coherent manner. Big Data Science 16

17 Distributed File System Design Goal of DFS Failure transparency: The client and client programs should operate correctly after a server failure. Heterogeneity: File service should be provided across different hardware and operating system platforms. Scalability: The file system should work well in small environments (1 machine, a dozen machines) and also scale gracefully to huge ones (hundreds through tens of thousands of systems). Replication transparency: To support scalability, we may wish to replicate files across multiple servers. Clients should be unaware of this. Migration transparency: Files should be able to move around without the client's knowledge. Big Data Science 17

18 Examples Distributed File System GFS (Google Inc.) Ceph (Inktank) Windows Distributed File System (DFS) (Microsoft) FhGFS (Fraunhofer) GlusterFS (Red Hat) Lustre Ibrix Big Data Science 18

19 Hadoop Distributed File System (HDFS) The existing file system such as large file storage, NAS or SAN is expensive. It needs very high performance server. HDFS can combine Web server level hosts into a disk storage. Some high performance large storage system will be needed. SAN is suitable for DBMS, NAS for safe file store. It can process big data concurrently using Map-Reduce. Four Fundamental Objectives of HDFS Error recovery Access data using streaming Large data storage Data integrity Big Data Science 19

20 Hadoop Distributed File System Big Data Science 20

21 Block Structured File System File Copy Structure on HDFS 320 MB File Block 1 Block 2 Block 3 Block 4 Block 5 HDFS Block 1 Block 3 Block 4 Block 2 Block 3 Block 4 Block 1 Block 3 Block 5 Block 2 Block 4 Block 5 Block 1 Block 2 Block 5 Big Data Science 21

22 Block Structured File System Source:bradhedlund.com Big Data Science 22

23 Block Structured File System Source:bradhedlund.com Big Data Science 23

24 NoSQL HDFS,NoSQL, Map-Reduce No SQL database provides a mechanism for storage and retrieval of data that employs less constrained consistency models than traditional relational databases. Motivations for this approach include simplicity of design, horizontal scaling and finer control over availability. NoSQL databases are often highly optimized key value stores intended for simple retrieval and appending operations, with the goal being significant performance benefits in terms of latency and throughput. Big Data Science 24

25 Map Operation What is a Map operation Doing something to every element in an array is a common operation. The operation can be considered as Map function. int a = {1,2,3}; for (I = 0; I < a.length; i++) a[i] = a[i] * 2; This can be written as a function. New value for the variable a would be int a = {2,4,6}; Map-Reduce Operation Source:bigdatauniversity.com Big Data Science 25

26 Map Operation What is a Map operation Doing something to every element in an array is a common operation. The operation can be considered as Map function. int a = {1,2,3}; for (I = 0; I < a.length; i++) a[i] = f x (a[i]); All of this can also be converted into a map function. New value for the variable a would be int a = {2,4,6}; Like this, where f x is a function defined as : function f x (x) {return x * 2 ;} Big Data Science 26

27 Map Operation What is a Map operation like this, where f x is a function passed as an argument: function map (f x, a) { } for (I = 0; I < a.length; i++) a[i] = f x (a[i]); You can invoke this map function as follows map(function(x) {return x*2; }, a); This is function f x whose defintion is included in the call. Big Data Science 27

28 Reduce Operation What is a Reduce operation Another common operation on arrays is to combine all their value: function sum(a) { int s = 0; for (i = 0; I < a.length; i++) s += a[i] ; return s; } This can be written as a function. Big Data Science 28

29 Reduce Operation What is a Reduce operation Another common operation on arrays is to combine all their values: function sum(a) { int s = 0; for (i = 0; I < a.length; i++) s = f x (s, a[i]) ; return s; } Like this, where a function f x is defined so that it adds its arguments: function f x (a,b) {return a+b ;} The whole function sum can also be rewritten so that f x is passed as an argument. as a reduce operation: reduce(function(a,b) {return a+b; }, a, 0); Big Data Science 29

30 Submitting a MapReduce Job Source:bigdatauniversity.com Big Data Science 30

31 Shuffle Process 1. Split Creation 2. Map 3. Spill 4. Merge 5. Copy 6. Sort 7. Reduce Input Split Map1 Reduce1 Input Split Map2 Reduce2 Input Split Map3 Memory Buffer Partition Map Output Data Memory Buffer, File File Source: Beginning Hadoop Programming Big Data Science 31

32 MapReduce Type and Data Flow Lists Source:bigdatauniversity.com Big Data Science 32

33 HADOOP MapReduce Program : WordCount.java import java.io.ioexception; import java.util.stringtokenizer; import org.apache.hadoop.util.genericoptionsparser; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.longwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.input.textinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; import org.apache.hadoop.mapreduce.lib.output.textoutputformat; public class WordCount { // Mapper Implementation public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new protected void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); } } } // Reducer Implementation public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable value = new protected void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) sum += value.get(); value.set(sum); context.write(key, value); } } public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); // Parameter Passing GenericOptionsParser parser = new GenericOptionsParser(conf, args); args = parser.getremainingargs(); // Creating of Job Job job = new Job(conf, "wordcount"); job.setjarbyclass(wordcount.class); // Assign Classes for Mapper/Reducer job.setmapperclass(map.class); job.setreducerclass(reduce.class); // Other Setting for Job job.setmapoutputkeyclass(text.class); job.setmapoutputvalueclass(intwritable.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); Big Data Science 33

34 Map-Reduce Architecture on Hadoop Client Job Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Map Map Map Reduce Reduce Reduce Map Map Map Reduce Reduce Reduce Map Map Map Reduce Reduce Reduce Source: Beginning Hadoop Programming Big Data Science 34

35 Map-Reduce Architecture on Hadoop Client: Map-Reduce API provided by Hadoop and Map-Reduce Program executed by a user. Job Tracker A Map-Reducer program is managed as a work unit, Job. Manage scheduling and monitoring of entire jobs on Hadoop cluster. Running on a Name Node server usually, but necessarily. Description of Job Tracker When a user requests a new job, it calculates how many Maps and Reduces will be executed for the job. Big Data Science 35

36 Map-Reduce Architecture on Hadoop Description of Job Tracker A Map-Reduce program that executed by a client through Hadoop is managed by the unit of Job. The Job Tracker monitors scheduling of all jobs registered on Hadoop cluster. But we don t have to execute the job tracker on name node server. Decide Task Trackers where the Maps-Reduces will be executed and assign the job to the Task Trackers decided. -> Then the task trackers execute the Map- Reduce programs. The job tracker communicates with task trackers with heart-beats for checking status of task trackers and job execution information. Big Data Science 36

37 Map-Reduce Architecture on Hadoop Task Tracker A demon program that executes Map-Reduce program of a user on a Data Node. Creates Map-Tasks and Reduce-Tasks that the Job Tracker requested. Execute the tasks (Map and Reduces) by running new JVM. Workflow of Map-Reduce A user executes a job by invoking waitforcompletion method. A job client object is created in the job interface, and the object access to the Job Tracker to execute the job. Big Data Science 37

38 Name Node Map-Reduce Operation on Hadoop Data Node Client 1. Map-Reduce Job Execute Job Client Input Data 2. New Job Submit 3. Input Split Creation Job Tracker 4. Job Assignment Task Tracker Mapper Mapper 5. Mapper Execution Task Task Job Task Task Task Partitioner Key-Value Key-Value 6. Sort and Merge Reducer Reducer 7. Reducer Execution Key-Value Key-Value Output Data 8. Output Data Save Source: Beginning Hadoop Programming Big Data Science 38

39 Map-Reduce Architecture on Hadoop Workflow of Map-Reduce Then, the Job Tracker returns a new job ID to the Job Client. And the Job Client checks the output path defined by the user. The Job Client calculates input split (Size of input file to be processed in a Map procedure) about the input data of the job. After the calculation of the input split, it saves input split information, setting file, and Map- Reduce JAR file to HDFS, and notifies completion of preparation for starting the job. The Job Tracker registers the job on the Queue, Job scheduler initializes the job from the Queue. And then, the Job Scheduler creates Map tasks according to the input split information and assign IDs. Big Data Science 39

40 Map-Reduce Architecture on Hadoop Workflow of Map-Reduce The Task Tracker gives its status to the Job Tracker periodically calling heartbeat method. In heartbeat information, there are resource information about CPU, memory, disk, number of tasks, and available maximum tasks, new tasks execution possibilities. The Task Tracker executes the Map task assigned by the Job Tracker. The Map task executes the logic in the Map method, and save output data to memory buffer. At this time, a patitioner decides a Reduce task to which the output data should be transferred, and assign a suitable partition for it. The data in memory will be sorted by the keys, and saved to local hard disk. When the save is to be completed, the files will be sorted and merged into one file. Big Data Science 40

41 Map-Reduce Architecture on Hadoop Workflow of Map-Reduce A reduce task is the Reducer class written by a user. The task can be executed when all output data are to be prepared. When the Map task completes output, it notifies its completion to the Task Tracker that ran itself. When the Task Tracker receives the message, it notifies status of the corresponding Map task and output data path of the Map task to the Job Tracker. The Reduce task copies output data of all Map tasks, and merges output data of the Map tasks. When the merge is to be completed, it executes analysis logic to call Reduce method. The Reduce task saves the output data to HDFS as name of part-nnnnn. Where nnnnn means partition ID. Big Data Science 41

42 Component of Map-Reduce Programming Data Types The Map-Reduce is the optimized object for network communication, and provides WritableComparable interface. All data types for Key and Value in Map-Reduce program should implement the WritableComparable interface. The WritableComparable interface inherits the Writable and Comparable interfaces. The write method serializes data value, and the readfields method unserializes to read the serialized data. public interface Writable { void write (DataOutput out) throws IOException; void readfields (DataInput in) throws IOException; } Big Data Science 42

43 Component of Map-Reduce Programming Data Types The Wrapper class to implement the WritableComparable interface for each data type: BoolleanWritable, ByteWritable, DoubleWritable, FloatWritable, IntWritable, LongWritable, TextWritable, NullWritable. InputFormat The InputFormat abstract class let us use it as input parameter of the map method. public abstract class InputFormat(K, V> { public abstract List<InputSplit> getsplits(jobcontext context) throws IOException, InterruptedException; public abstract RecordReader<K, V> createrecordreader(inputsplit split, TaskAttemptContext context) throws IOException, InterruptedException; } Big Data Science 43

44 Component of Map-Reduce Programming InputFormat getsplits : the map can use input split. createrecordreader : creates RecordReader object so that the map method uses input split in the form of key and list. The map method executes analysis logic to read key and value in the RecordReader object. Kinds of the InputFormat: TextInputFormat, KeyValueTextInputFormat, NLineInputFormat, DelegatingInputFormat, CombineFileInputFormat, SequenceFileInputFormat, SequenceFileAsBinaryInputFormat, SequenceFileAsTextInputFormat Big Data Science 44

45 Component of Map-Reduce Programming Mapper Class Carry out function of map method of Map-Reduce programming. It receives input data consisting key and value, then process and classify the data to create new data list. Class definition: using <Input Key Type, Input Value Type, Output Key Type, Output Value Type> Context object: Getting information about job, and read input split as unit of record. The RecordReader object is passed to the map method to read data in the form of key and value. A Map programmer overwrites the map method. Next, when the run method is to be executed, the map method will be executed for all keys in the Context object. Big Data Science 45

46 Component of Map-Reduce Programming Mapper Class public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { public class Context extends MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> { public Context(Configuration conf, TaskAttemptID taskid, RecordReader<KEYIN,VALUEIN> reader, RecordWriter<KEYOUT,VALUEOUT> writer, OutputCommitter committer, StatusReporter reporter, InputSplit split) throws IOException, InterruptedException { super(conf, taskid, reader, writer, committer, reporter, split); } protected void map(keyin key, VALUEIN value, Context context) throws IOException, InterruptedException { context.write((keyout) key, (VALUEOUT) value); } Public void run(context context) throws IOException, InterruptedException { setup (context); while (context.nextkeyvalue()) { map(context.getcurrentkey(), context.getcurrentvalue(), context); } } Big Data Science 46

47 Component of Map-Reduce Programming Partitioner To decide to which Reduce task the data of Map task will be passed. Calculating a partition as key.hashcode() & Integer.MAX_VALUE) % numreducetasks According to the result, the partition will be created at the node that Map task has been executed, and output data of the Map task is saved. When the Map task is completed, the data at the partition will be transmitted to corresponding Reduce task through network. The getpartition method can be overwritten for other partitioning strategy. public class HashPartitioner<K2, V2> implements Partitioner<K2, V2> { public int getpartition(k2 key, V2 value, int numreducetasks) { return (key.hashcode() & Integer.MAX_VALUE) % numreducetasks; } } Big Data Science 47

48 Component of Map-Reduce Programming Reducer The Reducer class receives the output data of the Map task as input data to execute aggregation operation. Class definition: using <Input Key Type, Input Value Type, Output Key Type, Output Value Type> Like Mapper class, the Context object: inherits the ReduceContext. Consult with job for information, getting the information as RawKeyValueIterator for checking input value list. RecordWriter as parameter so that output result of map method can be as unit of record. A Reducer programmer overwrites the reduce method. Big Data Science 48

49 Component of Map-Reduce Programming Reducer Class public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> { public class extends ReduceContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> { public Context(Configuration conf, TaskAttemptID taskid, RawKeyValueIterator input, Counter inputkeycounter, Counter inputvaluecounter, RecordWriter<KEYOUT,VALUEOUT> output, OutputCommitter committer, StatusReporter reporter, RawComparator<KEYIN> comparator, Class<KEYIN> keyclass, Class<VALUEIN> valueclass ) throws IOException, InterruptedException { super(conf, taskid, input, inputkeycounter, inputvaluecounter, output, committer, reporter, comparator, keyclass, valueclass); } } protected void reduce(keyin key, Iterable<VALUEIN> values, Context context ) throws IOException, InterruptedException { for(valuein value: values) { context.write((keyout) key, (VALUEOUT) value); } } public void run(context context) throws IOException, InterruptedException { setup(context); while (context.nextkey()) { reduce(context.getcurrentkey(), context.getvalues(), context); } cleanup(context); } } // end of class Reducer Big Data Science 49

50 Component of Map-Reduce Programming Combiner Class Shuffle: Transportation process between Map task and Reduce task. It needs to reduce the amount of data that will be transferred through network to improve performance of entire jobs. The Combiner class inputs the output data of Mapper and creates reduced data on the local node to send through network. The Reducer should produce the same output in the both case: with the Combiner and without the class. Big Data Science 50

51 Component of Map-Reduce Programming OutputFormat Output data format of Map-Reduce is created by the format of setoutputformatclass method in Job interface. Output data format is made by inheriting the abstract class OutputFormat. Kinds of the OutputFormat: TextOutputFormat, SequenceFileOutputFormat, SequenceFileAsBinaryOutputFormat, FilterOutputFormat, LazyOutputFormat, NullOutputFormat. The classes inherit the FileOutputFormat class which inherits the OutputFormat class. Big Data Science 51

52 Document Frequency But document frequency (df) may be better: df = number of docs in a corpus containg the term Word cf df pandemic influenza Document/Collection frequency weighting is only possible in known(static) collection. So how do we make use of df? Big Data Science 52

53 TF IDF term weights TF IDF measure combines: Term Frequency (TF) or wf, some measure of term density in a doc Inverse document frequency (IDF) measure of informativeness of a term: its rarity across the whole corpus Could just be raw count of number of documents the term occurs in (IDF i = 1/DF i ) but by far the most commonly used version is: IDF i = log (n/df i ) Big Data Science 53

54 Summary : TF IDF (or tf.idf) Assign a tf.idf weight to each term i in each document d Increases with the number of occurrences within a doc Increases with the rarity of the term across the whole corpus Big Data Science 54

55 HADOOP MapReduce Program : Calculating TF-IDF Function: Calculating TF-IDF using Map-Reduce on Hadoop Input: word in document // set of words that are split Output: TF-IDF for each word //category of words set 1st Map function: for each word output <word@document title,1> 1st Reduce function: for each word@document n = number of each word@document file output file word@documet title, n 2nd Map function for each word@document title output <document title,word=n> 2nd Reduce function for each document N = number of word in all documents for each word output word@document title, n/n 3rd Map function for each word@document output <word,document title=n/n> 3rd Reduce function for each word D = number of word in all documents for each document d = number of word in document calculate TF-IDF value use n, N, d and D output word@documet,tf-idf value } Big Data Science 55

56 Hadoop Eco System Source: Big Data Science 56

57 Hadoop Eco System The Hadoop HDFS and Map-Reduce function are useful for Big data processing, but just for specific Java developers and providing low level processing. The system makes Hadoop more usable and layman-accessible with user friendly manner. It does not mean to all be used together, but as parts of a single organism; some may even be seeking to solve the same problem in different ways. Big Data Science 57

58 YARN Yet Another Resource Negotiator (YARN) addresses problems with MapReduce 1.0 s architecture, specifically with the JobTracker service. YARN "split[s] up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global Resource Manager (RM) and perapplication ApplicationMaster (AM)." (source: Apache) Thus, rather than burdening a single node with handling scheduling and resource management for the entire cluster, YARN now distributes this responsibility across the cluster. Big Data Science 58

59 Avro Hadoop Eco System A framework for performing remote procedure calls and data serialization. In the context of Hadoop, it can be used to pass data from one program or language to another, e.g. from C to Pig. BigTop A project for packaging and testing the Hadoop ecosystem. Much of BigTop's code was initially developed and released as part of Cloudera's CDH distribution, but has since become its own project at Apache. Chukwa A data collection and analysis system built on top of HDFS and MapReduce. Tailored for collecting logs and other data from distributed monitoring systems, Chukwa provides a workflow that allows for incremental data collection, processing and storage in Hadoop. Big Data Science 59

60 Drill Hadoop Eco System A distributed system for executing interactive analysis over largescale datasets. Some explicit goals of the Drill project are to support real-time querying of nested data and to scale to clusters of 10,000 nodes or more. Flume A tool for harvesting, aggregating and moving large amounts of log data in and out of Hadoop. HBase Based on Google's Bigtable, HBase "is an open-source, distributed, versioned, column-oriented store" that sits on top of HDFS. HBase is column-based rather than row-based, which enables high-speed execution of operations performed over similar values across massive data sets, e.g. read/write operations that involve all rows but only a small subset of all columns. Big Data Science 60

61 Hadoop Eco System HCatalog A metadata and table storage management service for HDFS. HCatalog depends on the Hive metastore and exposes it to other services such as MapReduce and Pig with plans to expand to HBase using a common data model. Hive Provides a warehouse structure and SQL-like access for data in HDFS and other Hadoop input sources (e.g. Amazon S3). Hive's query language, HiveQL, compiles to MapReduce. It also allows user-defined functions (UDFs). Oozie A job coordinator and workflow manager for jobs executed in Hadoop, which can include non-mapreduce jobs. Big Data Science 61

62 Hadoop Eco System Mahout Pig Mahout is a scalable machine-learning and data mining library. There are currently four main groups of algorithms in Mahout: recommendations, a.k.a. collective filtering classification, a.k.a categorization clustering frequent itemset mining, a.k.a parallel frequent pattern mining Framework consisting of a high-level scripting language (Pig Latin) and a run-time environment that allows users to execute MapReduce on a Hadoop cluster. Like HiveQL in Hive, Pig Latin is a higher-level language that compiles to MapReduce. Big Data Science 62

63 Sqoop Hadoop Eco System Sqoop ("SQL-to-Hadoop") is a tool which transfers data in both directions between relational systems and HDFS or other Hadoop data stores, e.g. Hive or HBase. ZooKeeper A service for maintaining configuration information, naming, providing distributed synchronization and providing group services. Spark (UC Berkeley) A parallel computing program which can operate over any Hadoop input source: HDFS, HBase, Amazon S3, Avro, etc. Spark is an open-source project at the U.C. Berkeley AMPLab, and in its own words, Spark "was initially developed for two applications where keeping data in memory helps: iterative algorithms, which are common in machine learning, and interactive data mining." Big Data Science 63

64 Apache Pig High-Level Data Flow Language Made of two components: Data processing language Pig Latin Complier to translate Pig Latin to MapReduce It abstracts a programmer from specific details and allows to focus on data processing Big Data Science 64

65 Pig in the Hadoop Ecosystem Source: sigmanac.com Big Data Science 65

66 Pig Latin Pig Latin users = LOAD users.text USING PigStorage (, ) AS (name, age); pages = LOAD pages.txt USING PigStorage (, ) AS (user, url); filteredusers = FILTER users BY age >= 18 and age <= 50; joinresult = JOIN filteredusers BY name, ages by user; grouped = GROUP joinresult by url; summed = FOREACH grouped GENERATE group, COUNT(joinResult) as clicks; sorted = ORDER summed BY clicks desc; top10 = LIMIT sorted 10; STORE top10 INTO top10sites ; Big Data Science 66

67 Apache Hive: SQL for Hadoop Data Warehousing Layer on top of Hadoop Allows Analysis and queries using a SQL-like language It is the best for data analysts familiar with SQL who need to do ad-hoc queries, summarization and data analysis Source: sigmanac.com Big Data Science 67

68 Hive Architecture Big Data Science 68

69 Example Hive Example CREATE TABLE users (name STRING, age INT); CREATE TABLE pages (user STRING, url STRING); LOAD DATA INPATH /user/sandbox/users.txt INTO TABLE users ; LOAD DATA INPATH /user/sandbox/pages.txt INTO TABLE pages ; SELECT pages.url, count(*) AS clicks FROM users JOIN pages ON (users.name = pages.user) WHERE users.age >= 18 AND users.age <= 50 GROUP BY pages.url SORT By clicks DESC LIMIT 10; Big Data Science 69

70 Mahout in Hadoop Data Warehousing Layer on top of Hadoop Allows Analysis and queries using a SQL-like language It is the best for data analysts familiar with SQL who need to do ad-hoc queries, summarization and data analysis Copy source: Big Data Science Copyright: NTT Data Co. 70

71 Mahout Package What does Mahout provide: Several Data Mining Algorithms Classification Clustering Association Analysis Recommendations Others 71

72 Classification Assigning some objects to one of several predefined categories: Politics, Economics, Sports of Reuter news paper Some figures of galaxies Classifying credit card transactions as legitimate or fraudulent Classify malicious code or ot Algorithms Naïve Bayes Decision Tree SVM Linear Regression Neural Network Rule-based Methods 72

73 Clustering Grouping objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Partitional Clustering / Hierarchical Clustering K-Means, Fuzzy K-Means, Density- Based, Different distance measures Manhattan, Euclidean, Other domain dependent methods 73

74 Association Analysis Find the frequent item sets in transaction DB <milk, bread, cheese> are sold frequently together Apriori principle Market analysis, access pattern analysis, etc TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke 74

75 Recommendation Two Representative Types of Recommendation Contents Based Recommendation Collaborative Filtering (By rating of similar users) Trust-Based Recommendation Online and Offline Support Several Similarity Measures Cosine, LLR, Pearson Correlation 75

76 Others Outlier detection Math library Vectors, matrices, etc. Noise reduction 76

77 Mahout Example: K-Mean Clustering Big data application supports: Ma-Reduce programming, Pig, Hive, etc Some algorithms such as collaboration filtering, clustering, classification, association is difficult to implement. Example of K-Mean Clustering Big Data Science 77

78 Mahout Example: K-Mean Clustering Short Introduction to Installation and Clustering of Human Development Report Data Set Mahout Source Download: Extracting files and set MAHOUT_HOME variable. Getting data set: ol/synthetic_control. Execution of K-Mean example bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.job Big Data Science 78

79 Initializing K-Mean Algorithm public static void main(string[] args) throws Exception { Path output = new Path("output"); Configuration conf = new Configuration(); HadoopUtil.delete(conf, output); run(conf, new Path("testdata"), output, new EuclideanDistanceMeasure(), 6, 0.5, 10); public int run(string[] args) throws Exception{ addinputoption(); addoutputoption(); addoption(defaultoptioncreator.distancemeasureoption().cre ate()); addoption(defaultoptioncreator.numclustersoption().create( )); addoption(defaultoptioncreator.t1option().create()); addoption(defaultoptioncreator.t2option().create()); addoption(defaultoptioncreator.convergenceoption().create( )); addoption(defaultoptioncreator.maxiterationsoption().create ()); addoption(defaultoptioncreator.overwriteoption().create()); Map<String, String> argmap = parsearguments(args); if (argmap == null) { return -1; } Path input = getinputpath(); Path output = getoutputpath(); String measureclass = getoption(defaultoptioncreator.distance_measure_opt ION); if (measureclass == null) { measureclass = SquaredEuclideanDistanceMeasure.class.getName(); } double convergencedelta = Double.parseDouble(getOption(DefaultOptionCreator.CONVERGENCE_DELTA_OPTION)); int maxiterations = Integer.parseInt(getOption(DefaultOptionCreator.MA X_ITERATIONS_OPTION)); if (hasoption(defaultoptioncreator.overwrite_option) ) { HadoopUtil.delete(getConf(), output); } DistanceMeasure measure = ClassUtils.instantiateAs(measureClass, DistanceMeasure.class); if (hasoption(defaultoptioncreator.num_clusters_opt ION)) { int k = Integer.parseInt(getOption(DefaultOptionCreator.N UM_CLUSTERS_OPTION)); run(getconf(), input, output, measure, k, convergencedelta, maxiterations); } else { double t1 = Double.parseDouble(getOption(DefaultOptionCreator.T1_ OPTION)); double t2 = Double.parseDouble(getOption(DefaultOptionCreator.T2_ OPTION)); run(getconf(), input, output, measure, t1, t2, convergencedelta, maxiterations); } return 0; } Big Data Science 79

80 Setting K-Mean Algorithm public static void run(configuration conf, Path input, Path output, DistanceMeasure measure, int k, double convergencedelta, int maxiterations) throws Exception{ Path directorycontainingconvertedinput = new Path(output, DIRECTORY_CONTAINING_CONVERTED_INPUT); log.info("preparing Input"); InputDriver.runJob(input, directorycontainingconvertedinput, "org.apache.mahout.math.randomaccesssparsevector"); log.info("running random seed to get initial clusters"); Path clusters = new Path(output, Cluster.INITIAL_CLUSTERS_DIR); clusters = RandomSeedGenerator.buildRandom(conf, directorycontainingconvertedinput, clusters, k, measure); log.info("running KMeans"); KMeansDriver.run(conf, directorycontainingconvertedinput, clusters, output, measure, convergencedelta, maxiterations, true, false); // run ClusterDumper ClusterDumper clusterdumper = new ClusterDumper(finalClusterPath(conf, output, maxiterations), new Path(output, "clusteredpoints")); clusterdumper.printclusters(null); } Big Data Science 80

81 Example of Visualization of Clustering Big Data Science 81

82 Research Issues on Big Data Infra- Structure and Application Design of Efficient Map-Reduce for Important Algorithms: Counting(Word Count/TF-IDF)/Sort Several Data Mining Algorithms (Classification/Clustering/Association) Operations on Graphs/Networks Other Applications Application of Hadoop Map-Reduce on Several Domains Web/SNS data/bio/sensors/iot Many Domains: Triple Engine for WoD/ SA on Big Data Big Data Science 82

83 Situation Awareness on Big Data Big Data Science

84 Three layers for situation awareness Integration / Ontology Learning Data Mining / Machine Learning Reasoning / Prediction Inferred Metadata (Projection) Learned Relationship Metadata (Comprehension) Processed Information Entity Metadata (Perception) New Fact / Rules Semantics / Understanding / Insight Information Services for Data from Web/Smart Grid/Web/IOT/Cloud (World) Data Big Data Science

85 Layers for Situation Awareness on Smart Grid Services Projection Layer / Knowledge Extraction, Inference Comprehension Layer / Ontology Learning, Inference Perception Layer / Data Mining, Signal Processing Web Services / SOAP, RESTful Services Smart Grid Network / Sensors, Internet of Things Big Data Science

86 Active Situation Awareness Projection recommendtoparticipate TheEvent(Building, Event) needreplyto (ITM) checkhisevent (ITM) Comprehension (Situation) givehottopic (ITM,ATopicHisBlog) hasevent (Building, Event) israre(event) saycelebration (ITM, myblog) Perception Stand (People, Longline) isat (People, Building) Wrote (ITM, myblog) needreplyto (ITM) World Facebook Twitter Google Web Data Service 86 Big Data Science

87 A Framework for ASA on Big Data Big Data Infrastructure by Hadhoop and Data Mining Algorithm on it. Consistent Web APIs for ASA Upper Layer Big Data Infrastructure by Hadhoop - What is data format of SG? - How can get data from SG? - How Map between SG and Hive? - How can get data from Hive? - How map between Hive and Services? Smart Grid Infrastructure Big Data Science

88 Big Data Infrastructure To ASA Perception Layer Big Data Reasoner ETL Query ETL Map- Reduce Engine Events with Time and Location (ETL) Identification & Collector Big Data Science Big Data from SNS/Web/Sensors

89 Big Data Infrastructure <K,V> scheme for Event-Time-Location is S1: <E, <T,L>> S2: <T, <E,L>> S3: <L, <E,T>> where, E is event, T time and L location. Map-Reduce Function for Event, Time and Location Function: Retrieving E,T,L using Map-Reduce on Hadoop Input: SNS data set // set of words that are split Output: E/T/V related to Event //category of words set Map function for Event: for each word output <event@input, <T,L> pair> Reduce function for Event: for each event@input <T,L> pair set by temporal and spatial relationship reasoning output file reasoned <T,L> pair set - Map-Reduce function for the other element can be obtained the same fashion Big Data Science

90 Research Issues on Big Data Infra- Structure and Application Optimization of Big Data Infrastructure Locality/Network Aware Big Data Infrastructure Pipelining of Partitioning/Shuffling of Map- Reduce Considering Map-Oriented/Reduce-Oriented Operation Block & Job Scheduling for Optimization Big Data Science 90

91 Network-Aware Optimal Data Allocation for Map-Reduce Big Data Science

92 Why heavy network load? Data movement costs in map phrase: if the grouping data are distributed by Hadoop's random strategy, the shaded map tasks with either remote data access or queueing delay are the performance barriers; whereas if these data are evenly distributed, the MapReduce program can avoid these barriers. Intermediate data shuffling cost across racks in reduce phrase. TT j stands for a Task Tracker node j. P Ri:TTj stands for a partition P produced at TT j and hashed to reducer, Shuffling P R0:TT0 from TTO to TT3, P R1:TT2 from TT2 to TTO, and P R1:TT3 from TT3 to TTO, In addition, P R1:TT1 will be shuffled from TTl to TTO and P R0:TT3 will remain local to TT3. Network load: Off-Rack>rack-local>Node-local Big Data Science

93 Example of Minimizing Cost by Data Replacement Big Data Science

94 Research Problem The research problem is: To develop the mathematic models and theories; To develop learning algorithm, For the optimization of objective function defined in Eq. (1) with constrains (2) and (3). Big Data Science

95 Optimization Example It may show intuitively: Replicas of each data blocks are distributed into each main server node clusters. Data blocks whom share high similarity will be replaced closer. Big Data Science

96 References Tom White, Hadoop, OREILLLY, 2011 Srinath Perera, Thilina Gunarathne, Hadoop Map- Reduce Programming, Packt Publishing, 2013 J.H Jeong, Beginning Hadoop Programming: Development and Operations, Wiki Books, 2012 Big Data University, I. Paik, R. Sawa, K. Ofuji, and N. Yen, Introduction to Big Data Science, Graduate School Course, Big Data Science 96

Getting to know Apache Hadoop

Getting to know Apache Hadoop Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated

More information

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology Working With Hadoop Now that we covered the basics of MapReduce, let s look at some Hadoop specifics. Mostly based on Tom White s book Hadoop: The Definitive Guide, 3 rd edition Note: We will use the new

More information

Word Count Code using MR2 Classes and API

Word Count Code using MR2 Classes and API EDUREKA Word Count Code using MR2 Classes and API A Guide to Understand the Execution of Word Count edureka! A guide to understand the execution and flow of word count WRITE YOU FIRST MRV2 PROGRAM AND

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

The Hadoop Eco System Shanghai Data Science Meetup

The Hadoop Eco System Shanghai Data Science Meetup The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related

More information

Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 3. Apache Hadoop Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Apache Hadoop Open-source

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

BIG DATA APPLICATIONS

BIG DATA APPLICATIONS BIG DATA ANALYTICS USING HADOOP AND SPARK ON HATHI Boyu Zhang Research Computing ITaP BIG DATA APPLICATIONS Big data has become one of the most important aspects in scientific computing and business analytics

More information

Hadoop Design and k-means Clustering

Hadoop Design and k-means Clustering Hadoop Design and k-means Clustering Kenneth Heafield Google Inc January 15, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise

More information

Hadoop WordCount Explained! IT332 Distributed Systems

Hadoop WordCount Explained! IT332 Distributed Systems Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

Data Science in the Wild

Data Science in the Wild Data Science in the Wild Lecture 3 Some slides are taken from J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 Data Science and Big Data Big Data: the data cannot

More information

Xiaoming Gao Hui Li Thilina Gunarathne

Xiaoming Gao Hui Li Thilina Gunarathne Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal

More information

Introduc)on to Map- Reduce. Vincent Leroy

Introduc)on to Map- Reduce. Vincent Leroy Introduc)on to Map- Reduce Vincent Leroy Sources Apache Hadoop Yahoo! Developer Network Hortonworks Cloudera Prac)cal Problem Solving with Hadoop and Pig Slides will be available at hgp://lig- membres.imag.fr/leroyv/

More information

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到

More information

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013

More information

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

More information

University of Maryland. Tuesday, February 2, 2010

University of Maryland. Tuesday, February 2, 2010 Data-Intensive Information Processing Applications Session #2 Hadoop: Nuts and Bolts Jimmy Lin University of Maryland Tuesday, February 2, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

Lecture 3 Hadoop Technical Introduction CSE 490H

Lecture 3 Hadoop Technical Introduction CSE 490H Lecture 3 Hadoop Technical Introduction CSE 490H Announcements My office hours: M 2:30 3:30 in CSE 212 Cluster is operational; instructions in assignment 1 heavily rewritten Eclipse plugin is deprecated

More information

Peers Techno log ies Pv t. L td. HADOOP

Peers Techno log ies Pv t. L td. HADOOP Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN

Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Hadoop Framework technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate addi8onal

More information

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology Hadoop Dawid Weiss Institute of Computing Science Poznań University of Technology 2008 Hadoop Programming Summary About Config 1 Open Source Map-Reduce: Hadoop About Cluster Configuration 2 Programming

More information

CS54100: Database Systems

CS54100: Database Systems CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics

More information

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com Hadoop and Eclipse Eclipse Hawaii User s Group May 26th, 2009 Seth Ladd http://sethladd.com Goal YOU can use the same technologies as The Big Boys Google Yahoo (2000 nodes) Last.FM AOL Facebook (2.5 petabytes

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

How To Write A Map Reduce In Hadoop Hadooper 2.5.2.2 (Ahemos)

How To Write A Map Reduce In Hadoop Hadooper 2.5.2.2 (Ahemos) Processing Data with Map Reduce Allahbaksh Mohammedali Asadullah Infosys Labs, Infosys Technologies 1 Content Map Function Reduce Function Why Hadoop HDFS Map Reduce Hadoop Some Questions 2 What is Map

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Zebra and MapReduce. Table of contents. 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples...

Zebra and MapReduce. Table of contents. 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples... Table of contents 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples... 2 1. Overview MapReduce allows you to take full advantage of Zebra's capabilities.

More information

Introduction to Big Data Science. Wuhui Chen

Introduction to Big Data Science. Wuhui Chen Introduction to Big Data Science Wuhui Chen What is Big data? Volume Variety Velocity Outline What are people doing with Big data? Classic examples Two basic technologies for Big data management: Data

More information

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab IBM CDL Lab Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网 Information Management 2012 IBM Corporation Agenda Hadoop 技 术 Hadoop 概 述 Hadoop 1.x Hadoop 2.x Hadoop 生 态

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış Istanbul Şehir University Big Data Camp 14 Hadoop Map Reduce Aslan Bakirov Kevser Nur Çoğalmış Agenda Map Reduce Concepts System Overview Hadoop MR Hadoop MR Internal Job Execution Workflow Map Side Details

More information

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Big Data Workshop. dattamsha.com

Big Data Workshop. dattamsha.com Big Data Workshop About Praveen Has more than15 years of experience working on various technologies. Is a Cloudera Certified Developer for Apache Hadoop CDH4 (CCD-410) with 95% score and got through the

More information

Deploying Hadoop with Manager

Deploying Hadoop with Manager Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer plinnell@suse.com Alejandro Bonilla / Sales Engineer abonilla@suse.com 2 Hadoop Core Components 3 Typical Hadoop Distribution

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Data-intensive computing systems

Data-intensive computing systems Data-intensive computing systems Hadoop Universtity of Verona Computer Science Department Damiano Carra Acknowledgements! Credits Part of the course material is based on slides provided by the following

More information

COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop

More information

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Hadoop. History and Introduction. Explained By Vaibhav Agarwal Hadoop History and Introduction Explained By Vaibhav Agarwal Agenda Architecture HDFS Data Flow Map Reduce Data Flow Hadoop Versions History Hadoop version 2 Hadoop Architecture HADOOP (HDFS) Data Flow

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200

Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200 Hadoop Learning Resources 1 Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200 Author: Hadoop Learning Resource Hadoop Training in Just $60/3000INR

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately

More information

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big

More information

Hadoop Basics with InfoSphere BigInsights

Hadoop Basics with InfoSphere BigInsights An IBM Proof of Technology Hadoop Basics with InfoSphere BigInsights Unit 2: Using MapReduce An IBM Proof of Technology Catalog Number Copyright IBM Corporation, 2013 US Government Users Restricted Rights

More information

Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Google Bing Daytona Microsoft Research

Google Bing Daytona Microsoft Research Google Bing Daytona Microsoft Research Raise your hand Great, you can help answer questions ;-) Sit with these people during lunch... An increased number and variety of data sources that generate large

More information

Cloud Computing Era. Trend Micro

Cloud Computing Era. Trend Micro Cloud Computing Era Trend Micro Three Major Trends to Chang the World Cloud Computing Big Data Mobile 什 麼 是 雲 端 運 算? 美 國 國 家 標 準 技 術 研 究 所 (NIST) 的 定 義 : Essential Characteristics Service Models Deployment

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 Lambda Architecture CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 1 Goals Cover the material in Chapter 8 of the Concurrency Textbook The Lambda Architecture Batch Layer MapReduce

More information

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems Introduction PRESENTATION to Hadoop, TITLE GOES MapReduce HERE and HDFS for Big Data Applications Serge Blazhievsky Nice Systems SNIA Legal Notice The material contained in this tutorial is copyrighted

More information

HADOOP. Revised 10/19/2015

HADOOP. Revised 10/19/2015 HADOOP Revised 10/19/2015 This Page Intentionally Left Blank Table of Contents Hortonworks HDP Developer: Java... 1 Hortonworks HDP Developer: Apache Pig and Hive... 2 Hortonworks HDP Developer: Windows...

More information

Hadoop: The Definitive Guide

Hadoop: The Definitive Guide FOURTH EDITION Hadoop: The Definitive Guide Tom White Beijing Cambridge Famham Koln Sebastopol Tokyo O'REILLY Table of Contents Foreword Preface xvii xix Part I. Hadoop Fundamentals 1. Meet Hadoop 3 Data!

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

Apache Hadoop: Past, Present, and Future

Apache Hadoop: Past, Present, and Future The 4 th China Cloud Computing Conference May 25 th, 2012. Apache Hadoop: Past, Present, and Future Dr. Amr Awadallah Founder, Chief Technical Officer aaa@cloudera.com, twitter: @awadallah Hadoop Past

More information

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 Hadoop and HDInsight in a Heartbeat HDInsight Essentials Rajesh Nadipalli Chapter No. 1 "Hadoop and HDInsight in a Heartbeat" In this package, you will find: A Biography of the author of the book A preview chapter from the book, Chapter

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Hadoop MapReduce: Review. Spring 2015, X. Zhang Fordham Univ.

Hadoop MapReduce: Review. Spring 2015, X. Zhang Fordham Univ. Hadoop MapReduce: Review Spring 2015, X. Zhang Fordham Univ. Outline 1.Review of how map reduce works: the HDFS, Yarn sorting and shuffling advanced topics: partial sort, total sort, join, chained mapper/reducer,

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from Hadoop Beginner's Guide Learn how to crunch big data to extract meaning from data avalanche Garry Turkington [ PUBLISHING t] open source I I community experience distilled ftu\ ij$ BIRMINGHAMMUMBAI ')

More information

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop

More information

Tutorial- Counting Words in File(s) using MapReduce

Tutorial- Counting Words in File(s) using MapReduce Tutorial- Counting Words in File(s) using MapReduce 1 Overview This document serves as a tutorial to setup and run a simple application in Hadoop MapReduce framework. A job in Hadoop MapReduce usually

More information

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM 1. Introduction 1.1 Big Data Introduction What is Big Data Data Analytics Bigdata Challenges Technologies supported by big data 1.2 Hadoop Introduction

More information

Hadoop Job Oriented Training Agenda

Hadoop Job Oriented Training Agenda 1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module

More information

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib

More information

Hadoop (Hands On) Irene Finocchi and Emanuele Fusco

Hadoop (Hands On) Irene Finocchi and Emanuele Fusco Hadoop (Hands On) Irene Finocchi and Emanuele Fusco Big Data Computing March 23, 2015. Master s Degree in Computer Science Academic Year 2014-2015, spring semester I.Finocchi and E.Fusco Hadoop (Hands

More information

map/reduce connected components

map/reduce connected components 1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains

More information

Hadoop: Understanding the Big Data Processing Method

Hadoop: Understanding the Big Data Processing Method Hadoop: Understanding the Big Data Processing Method Deepak Chandra Upreti 1, Pawan Sharma 2, Dr. Yaduvir Singh 3 1 PG Student, Department of Computer Science & Engineering, Ideal Institute of Technology

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

BIG DATA HADOOP TRAINING

BIG DATA HADOOP TRAINING BIG DATA HADOOP TRAINING DURATION 40hrs AVAILABLE BATCHES WEEKDAYS (7.00AM TO 8.30AM) & WEEKENDS (10AM TO 1PM) MODE OF TRAINING AVAILABLE ONLINE INSTRUCTOR LED CLASSROOM TRAINING (MARATHAHALLI, BANGALORE)

More information

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb

More information

Hadoop Configuration and First Examples

Hadoop Configuration and First Examples Hadoop Configuration and First Examples Big Data 2015 Hadoop Configuration In the bash_profile export all needed environment variables Hadoop Configuration Allow remote login Hadoop Configuration Download

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information