Tutorial, IEEE SERVICE 2014 Anchorage, Alaska Big Data Science: Fundamental, Techniques, and Challenges (HDFS & Map-Reduce)
|
|
- Terence Wright
- 8 years ago
- Views:
Transcription
1 Tutorial, IEEE SERVICE 2014 Anchorage, Alaska Big Data Science: Fundamental, Techniques, and Challenges (HDFS & Map-Reduce) Incheon Paik University of Aizu, Japan Big Data Science 1
2 Contents What is Big Data? Big Data Source Clustered File System & Distributed File System Hadoop Distributed File System (HDFS) Map-Reduce Operation Application of Map-Reduce Big Data Science 2
3 Overview of Big Data Science What is Big Data? Recently, there have been very large and complex data sets from nature, sensors, social networks, enterprises increasingly based on high speed computers and networks together. Big data is the term for a collection of the data sets that it becomes difficult to process using on-hand database management tools or traditional data processing applications. Big Data Science 3
4 Overview of Big Data Science Several Areas Meteorology Genomics Connectomics Complex physics simulations Biological and environmental research Internet search Finance and business informatics Data sets ubiquitous information-sensing mobile devices Aerial sensory technologies (remote sensing) Software logs Cameras Microphones Radio-frequency identification readers Wireless sensor networks Big Data Science 4
5 Clustered File System A file system which is shared by being simultaneously mounted on multiple servers. Clustered file systems can provide features like Location-independent addressing Redundancy which improve reliability or reduce the complexity of the other parts of the cluster. Parallel file systems are a type of clustered file system that spread data across multiple storage nodes, usually for redundancy or performance. Big Data Science 5
6 Clustered File System (DAS/NAS/SAN) Source: Big Data Science 6
7 Fundamental Differences in DAS/NAS/SAN Source: he.wikipedia.org Big Data Science 7
8 Storage Area Network Source: pccgroup.com Big Data Science 8
9 Shared-Disk / Storage Area Network A shared-disk file-system uses a storage-area network (SAN) to provide direct disk access from multiple computers at the block level. Access control and translation from file-level operations (application) to block-level operations (by SAN) must take place on the client node. Adds a mechanism for concurrency control which gives a consistent and serializable view of the file system, avoiding corruption and unintended data loss Usually employ some sort of a fencing mechanism to prevent data corruption in case of node failures. Big Data Science 9
10 Shared-Disk / Storage Area Network The SAN may use any of a number of block-level protocols, including SCSI, iscsi, HyperSCSI, ATA over Ethernet, Fibre Channel, etc. There are different architectural approaches to a shared-disk file-system. Distribute file information across all the servers in a cluster (fully distributed). Utilize a centralized metadata server. Both achieve the same result of enabling all servers to access all the data on a shared storage device. Big Data Science 10
11 Network Attached Storage (NAS) Source: Big Data Science 11
12 Network Attached Storage (NAS) Provides both storage and a file system, like a shared disk file system on top of a SAN. Typically uses file-based protocols (as opposed to block-based protocols a SAN would use) such as NFS, SMB/CIFS, AFP, or NCP. Design Considerations Avoiding single point of failure: Fault tolerance and high availability by data replication of one sort or another Performance: fast disk-access time and small amount of CPU-processing time over distributed structure Concurrency: for consistent and efficient multiple accesses to the same file or block by concurrency control or locking which may either be built into the file system or provided by an add-on protocol Big Data Science 12
13 NAS vs. SAN Source: turbotekcomputer.com Big Data Science 13
14 Distributed File System Distributed file systems do not share block level access to the same storage but use a network protocol. These are commonly known as network file systems, even though they are not the only file systems that use the network to send data. Distributed file systems can restrict access to the file system depending on access lists or capabilities on both the servers and the clients, depending on how the protocol is designed. Big Data Science 14
15 Distributed File System The difference between a distributed file system and a distributed data store Distributed file system (DFS) allows files to be accessed using the same interfaces and semantics as local files - e.g. mounting/unmounting, listing directories, read/write at byte boundaries, system's native permission model. Distributed data stores, by contrast, require using a different API or library and have different semantics (most often those of a database). Big Data Science 15
16 Distributed File System Design Goal of DFS Access transparency: Clients are unaware that files are distributed and can access them in the same way as local files are accessed. Location transparency: A consistent name space exists encompassing local as well as remote files. The name of a file does not give its location. Concurrency transparency: All clients have the same view of the state of the file system. This means that if one process is modifying a file, any other processes on the same system or remote systems that are accessing the files will see the modifications in a coherent manner. Big Data Science 16
17 Distributed File System Design Goal of DFS Failure transparency: The client and client programs should operate correctly after a server failure. Heterogeneity: File service should be provided across different hardware and operating system platforms. Scalability: The file system should work well in small environments (1 machine, a dozen machines) and also scale gracefully to huge ones (hundreds through tens of thousands of systems). Replication transparency: To support scalability, we may wish to replicate files across multiple servers. Clients should be unaware of this. Migration transparency: Files should be able to move around without the client's knowledge. Big Data Science 17
18 Examples Distributed File System GFS (Google Inc.) Ceph (Inktank) Windows Distributed File System (DFS) (Microsoft) FhGFS (Fraunhofer) GlusterFS (Red Hat) Lustre Ibrix Big Data Science 18
19 Hadoop Distributed File System (HDFS) The existing file system such as large file storage, NAS or SAN is expensive. It needs very high performance server. HDFS can combine Web server level hosts into a disk storage. Some high performance large storage system will be needed. SAN is suitable for DBMS, NAS for safe file store. It can process big data concurrently using Map-Reduce. Four Fundamental Objectives of HDFS Error recovery Access data using streaming Large data storage Data integrity Big Data Science 19
20 Hadoop Distributed File System Big Data Science 20
21 Block Structured File System File Copy Structure on HDFS 320 MB File Block 1 Block 2 Block 3 Block 4 Block 5 HDFS Block 1 Block 3 Block 4 Block 2 Block 3 Block 4 Block 1 Block 3 Block 5 Block 2 Block 4 Block 5 Block 1 Block 2 Block 5 Big Data Science 21
22 Block Structured File System Source:bradhedlund.com Big Data Science 22
23 Block Structured File System Source:bradhedlund.com Big Data Science 23
24 NoSQL HDFS,NoSQL, Map-Reduce No SQL database provides a mechanism for storage and retrieval of data that employs less constrained consistency models than traditional relational databases. Motivations for this approach include simplicity of design, horizontal scaling and finer control over availability. NoSQL databases are often highly optimized key value stores intended for simple retrieval and appending operations, with the goal being significant performance benefits in terms of latency and throughput. Big Data Science 24
25 Map Operation What is a Map operation Doing something to every element in an array is a common operation. The operation can be considered as Map function. int a = {1,2,3}; for (I = 0; I < a.length; i++) a[i] = a[i] * 2; This can be written as a function. New value for the variable a would be int a = {2,4,6}; Map-Reduce Operation Source:bigdatauniversity.com Big Data Science 25
26 Map Operation What is a Map operation Doing something to every element in an array is a common operation. The operation can be considered as Map function. int a = {1,2,3}; for (I = 0; I < a.length; i++) a[i] = f x (a[i]); All of this can also be converted into a map function. New value for the variable a would be int a = {2,4,6}; Like this, where f x is a function defined as : function f x (x) {return x * 2 ;} Big Data Science 26
27 Map Operation What is a Map operation like this, where f x is a function passed as an argument: function map (f x, a) { } for (I = 0; I < a.length; i++) a[i] = f x (a[i]); You can invoke this map function as follows map(function(x) {return x*2; }, a); This is function f x whose defintion is included in the call. Big Data Science 27
28 Reduce Operation What is a Reduce operation Another common operation on arrays is to combine all their value: function sum(a) { int s = 0; for (i = 0; I < a.length; i++) s += a[i] ; return s; } This can be written as a function. Big Data Science 28
29 Reduce Operation What is a Reduce operation Another common operation on arrays is to combine all their values: function sum(a) { int s = 0; for (i = 0; I < a.length; i++) s = f x (s, a[i]) ; return s; } Like this, where a function f x is defined so that it adds its arguments: function f x (a,b) {return a+b ;} The whole function sum can also be rewritten so that f x is passed as an argument. as a reduce operation: reduce(function(a,b) {return a+b; }, a, 0); Big Data Science 29
30 Submitting a MapReduce Job Source:bigdatauniversity.com Big Data Science 30
31 Shuffle Process 1. Split Creation 2. Map 3. Spill 4. Merge 5. Copy 6. Sort 7. Reduce Input Split Map1 Reduce1 Input Split Map2 Reduce2 Input Split Map3 Memory Buffer Partition Map Output Data Memory Buffer, File File Source: Beginning Hadoop Programming Big Data Science 31
32 MapReduce Type and Data Flow Lists Source:bigdatauniversity.com Big Data Science 32
33 HADOOP MapReduce Program : WordCount.java import java.io.ioexception; import java.util.stringtokenizer; import org.apache.hadoop.util.genericoptionsparser; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.longwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.input.textinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; import org.apache.hadoop.mapreduce.lib.output.textoutputformat; public class WordCount { // Mapper Implementation public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new protected void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); } } } // Reducer Implementation public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable value = new protected void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) sum += value.get(); value.set(sum); context.write(key, value); } } public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); // Parameter Passing GenericOptionsParser parser = new GenericOptionsParser(conf, args); args = parser.getremainingargs(); // Creating of Job Job job = new Job(conf, "wordcount"); job.setjarbyclass(wordcount.class); // Assign Classes for Mapper/Reducer job.setmapperclass(map.class); job.setreducerclass(reduce.class); // Other Setting for Job job.setmapoutputkeyclass(text.class); job.setmapoutputvalueclass(intwritable.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); Big Data Science 33
34 Map-Reduce Architecture on Hadoop Client Job Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Map Map Map Reduce Reduce Reduce Map Map Map Reduce Reduce Reduce Map Map Map Reduce Reduce Reduce Source: Beginning Hadoop Programming Big Data Science 34
35 Map-Reduce Architecture on Hadoop Client: Map-Reduce API provided by Hadoop and Map-Reduce Program executed by a user. Job Tracker A Map-Reducer program is managed as a work unit, Job. Manage scheduling and monitoring of entire jobs on Hadoop cluster. Running on a Name Node server usually, but necessarily. Description of Job Tracker When a user requests a new job, it calculates how many Maps and Reduces will be executed for the job. Big Data Science 35
36 Map-Reduce Architecture on Hadoop Description of Job Tracker A Map-Reduce program that executed by a client through Hadoop is managed by the unit of Job. The Job Tracker monitors scheduling of all jobs registered on Hadoop cluster. But we don t have to execute the job tracker on name node server. Decide Task Trackers where the Maps-Reduces will be executed and assign the job to the Task Trackers decided. -> Then the task trackers execute the Map- Reduce programs. The job tracker communicates with task trackers with heart-beats for checking status of task trackers and job execution information. Big Data Science 36
37 Map-Reduce Architecture on Hadoop Task Tracker A demon program that executes Map-Reduce program of a user on a Data Node. Creates Map-Tasks and Reduce-Tasks that the Job Tracker requested. Execute the tasks (Map and Reduces) by running new JVM. Workflow of Map-Reduce A user executes a job by invoking waitforcompletion method. A job client object is created in the job interface, and the object access to the Job Tracker to execute the job. Big Data Science 37
38 Name Node Map-Reduce Operation on Hadoop Data Node Client 1. Map-Reduce Job Execute Job Client Input Data 2. New Job Submit 3. Input Split Creation Job Tracker 4. Job Assignment Task Tracker Mapper Mapper 5. Mapper Execution Task Task Job Task Task Task Partitioner Key-Value Key-Value 6. Sort and Merge Reducer Reducer 7. Reducer Execution Key-Value Key-Value Output Data 8. Output Data Save Source: Beginning Hadoop Programming Big Data Science 38
39 Map-Reduce Architecture on Hadoop Workflow of Map-Reduce Then, the Job Tracker returns a new job ID to the Job Client. And the Job Client checks the output path defined by the user. The Job Client calculates input split (Size of input file to be processed in a Map procedure) about the input data of the job. After the calculation of the input split, it saves input split information, setting file, and Map- Reduce JAR file to HDFS, and notifies completion of preparation for starting the job. The Job Tracker registers the job on the Queue, Job scheduler initializes the job from the Queue. And then, the Job Scheduler creates Map tasks according to the input split information and assign IDs. Big Data Science 39
40 Map-Reduce Architecture on Hadoop Workflow of Map-Reduce The Task Tracker gives its status to the Job Tracker periodically calling heartbeat method. In heartbeat information, there are resource information about CPU, memory, disk, number of tasks, and available maximum tasks, new tasks execution possibilities. The Task Tracker executes the Map task assigned by the Job Tracker. The Map task executes the logic in the Map method, and save output data to memory buffer. At this time, a patitioner decides a Reduce task to which the output data should be transferred, and assign a suitable partition for it. The data in memory will be sorted by the keys, and saved to local hard disk. When the save is to be completed, the files will be sorted and merged into one file. Big Data Science 40
41 Map-Reduce Architecture on Hadoop Workflow of Map-Reduce A reduce task is the Reducer class written by a user. The task can be executed when all output data are to be prepared. When the Map task completes output, it notifies its completion to the Task Tracker that ran itself. When the Task Tracker receives the message, it notifies status of the corresponding Map task and output data path of the Map task to the Job Tracker. The Reduce task copies output data of all Map tasks, and merges output data of the Map tasks. When the merge is to be completed, it executes analysis logic to call Reduce method. The Reduce task saves the output data to HDFS as name of part-nnnnn. Where nnnnn means partition ID. Big Data Science 41
42 Component of Map-Reduce Programming Data Types The Map-Reduce is the optimized object for network communication, and provides WritableComparable interface. All data types for Key and Value in Map-Reduce program should implement the WritableComparable interface. The WritableComparable interface inherits the Writable and Comparable interfaces. The write method serializes data value, and the readfields method unserializes to read the serialized data. public interface Writable { void write (DataOutput out) throws IOException; void readfields (DataInput in) throws IOException; } Big Data Science 42
43 Component of Map-Reduce Programming Data Types The Wrapper class to implement the WritableComparable interface for each data type: BoolleanWritable, ByteWritable, DoubleWritable, FloatWritable, IntWritable, LongWritable, TextWritable, NullWritable. InputFormat The InputFormat abstract class let us use it as input parameter of the map method. public abstract class InputFormat(K, V> { public abstract List<InputSplit> getsplits(jobcontext context) throws IOException, InterruptedException; public abstract RecordReader<K, V> createrecordreader(inputsplit split, TaskAttemptContext context) throws IOException, InterruptedException; } Big Data Science 43
44 Component of Map-Reduce Programming InputFormat getsplits : the map can use input split. createrecordreader : creates RecordReader object so that the map method uses input split in the form of key and list. The map method executes analysis logic to read key and value in the RecordReader object. Kinds of the InputFormat: TextInputFormat, KeyValueTextInputFormat, NLineInputFormat, DelegatingInputFormat, CombineFileInputFormat, SequenceFileInputFormat, SequenceFileAsBinaryInputFormat, SequenceFileAsTextInputFormat Big Data Science 44
45 Component of Map-Reduce Programming Mapper Class Carry out function of map method of Map-Reduce programming. It receives input data consisting key and value, then process and classify the data to create new data list. Class definition: using <Input Key Type, Input Value Type, Output Key Type, Output Value Type> Context object: Getting information about job, and read input split as unit of record. The RecordReader object is passed to the map method to read data in the form of key and value. A Map programmer overwrites the map method. Next, when the run method is to be executed, the map method will be executed for all keys in the Context object. Big Data Science 45
46 Component of Map-Reduce Programming Mapper Class public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { public class Context extends MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> { public Context(Configuration conf, TaskAttemptID taskid, RecordReader<KEYIN,VALUEIN> reader, RecordWriter<KEYOUT,VALUEOUT> writer, OutputCommitter committer, StatusReporter reporter, InputSplit split) throws IOException, InterruptedException { super(conf, taskid, reader, writer, committer, reporter, split); } protected void map(keyin key, VALUEIN value, Context context) throws IOException, InterruptedException { context.write((keyout) key, (VALUEOUT) value); } Public void run(context context) throws IOException, InterruptedException { setup (context); while (context.nextkeyvalue()) { map(context.getcurrentkey(), context.getcurrentvalue(), context); } } Big Data Science 46
47 Component of Map-Reduce Programming Partitioner To decide to which Reduce task the data of Map task will be passed. Calculating a partition as key.hashcode() & Integer.MAX_VALUE) % numreducetasks According to the result, the partition will be created at the node that Map task has been executed, and output data of the Map task is saved. When the Map task is completed, the data at the partition will be transmitted to corresponding Reduce task through network. The getpartition method can be overwritten for other partitioning strategy. public class HashPartitioner<K2, V2> implements Partitioner<K2, V2> { public int getpartition(k2 key, V2 value, int numreducetasks) { return (key.hashcode() & Integer.MAX_VALUE) % numreducetasks; } } Big Data Science 47
48 Component of Map-Reduce Programming Reducer The Reducer class receives the output data of the Map task as input data to execute aggregation operation. Class definition: using <Input Key Type, Input Value Type, Output Key Type, Output Value Type> Like Mapper class, the Context object: inherits the ReduceContext. Consult with job for information, getting the information as RawKeyValueIterator for checking input value list. RecordWriter as parameter so that output result of map method can be as unit of record. A Reducer programmer overwrites the reduce method. Big Data Science 48
49 Component of Map-Reduce Programming Reducer Class public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> { public class extends ReduceContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> { public Context(Configuration conf, TaskAttemptID taskid, RawKeyValueIterator input, Counter inputkeycounter, Counter inputvaluecounter, RecordWriter<KEYOUT,VALUEOUT> output, OutputCommitter committer, StatusReporter reporter, RawComparator<KEYIN> comparator, Class<KEYIN> keyclass, Class<VALUEIN> valueclass ) throws IOException, InterruptedException { super(conf, taskid, input, inputkeycounter, inputvaluecounter, output, committer, reporter, comparator, keyclass, valueclass); } } protected void reduce(keyin key, Iterable<VALUEIN> values, Context context ) throws IOException, InterruptedException { for(valuein value: values) { context.write((keyout) key, (VALUEOUT) value); } } public void run(context context) throws IOException, InterruptedException { setup(context); while (context.nextkey()) { reduce(context.getcurrentkey(), context.getvalues(), context); } cleanup(context); } } // end of class Reducer Big Data Science 49
50 Component of Map-Reduce Programming Combiner Class Shuffle: Transportation process between Map task and Reduce task. It needs to reduce the amount of data that will be transferred through network to improve performance of entire jobs. The Combiner class inputs the output data of Mapper and creates reduced data on the local node to send through network. The Reducer should produce the same output in the both case: with the Combiner and without the class. Big Data Science 50
51 Component of Map-Reduce Programming OutputFormat Output data format of Map-Reduce is created by the format of setoutputformatclass method in Job interface. Output data format is made by inheriting the abstract class OutputFormat. Kinds of the OutputFormat: TextOutputFormat, SequenceFileOutputFormat, SequenceFileAsBinaryOutputFormat, FilterOutputFormat, LazyOutputFormat, NullOutputFormat. The classes inherit the FileOutputFormat class which inherits the OutputFormat class. Big Data Science 51
52 Document Frequency But document frequency (df) may be better: df = number of docs in a corpus containg the term Word cf df pandemic influenza Document/Collection frequency weighting is only possible in known(static) collection. So how do we make use of df? Big Data Science 52
53 TF IDF term weights TF IDF measure combines: Term Frequency (TF) or wf, some measure of term density in a doc Inverse document frequency (IDF) measure of informativeness of a term: its rarity across the whole corpus Could just be raw count of number of documents the term occurs in (IDF i = 1/DF i ) but by far the most commonly used version is: IDF i = log (n/df i ) Big Data Science 53
54 Summary : TF IDF (or tf.idf) Assign a tf.idf weight to each term i in each document d Increases with the number of occurrences within a doc Increases with the rarity of the term across the whole corpus Big Data Science 54
55 HADOOP MapReduce Program : Calculating TF-IDF Function: Calculating TF-IDF using Map-Reduce on Hadoop Input: word in document // set of words that are split Output: TF-IDF for each word //category of words set 1st Map function: for each word output <word@document title,1> 1st Reduce function: for each word@document n = number of each word@document file output file word@documet title, n 2nd Map function for each word@document title output <document title,word=n> 2nd Reduce function for each document N = number of word in all documents for each word output word@document title, n/n 3rd Map function for each word@document output <word,document title=n/n> 3rd Reduce function for each word D = number of word in all documents for each document d = number of word in document calculate TF-IDF value use n, N, d and D output word@documet,tf-idf value } Big Data Science 55
56 Hadoop Eco System Source: Big Data Science 56
57 Hadoop Eco System The Hadoop HDFS and Map-Reduce function are useful for Big data processing, but just for specific Java developers and providing low level processing. The system makes Hadoop more usable and layman-accessible with user friendly manner. It does not mean to all be used together, but as parts of a single organism; some may even be seeking to solve the same problem in different ways. Big Data Science 57
58 YARN Yet Another Resource Negotiator (YARN) addresses problems with MapReduce 1.0 s architecture, specifically with the JobTracker service. YARN "split[s] up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global Resource Manager (RM) and perapplication ApplicationMaster (AM)." (source: Apache) Thus, rather than burdening a single node with handling scheduling and resource management for the entire cluster, YARN now distributes this responsibility across the cluster. Big Data Science 58
59 Avro Hadoop Eco System A framework for performing remote procedure calls and data serialization. In the context of Hadoop, it can be used to pass data from one program or language to another, e.g. from C to Pig. BigTop A project for packaging and testing the Hadoop ecosystem. Much of BigTop's code was initially developed and released as part of Cloudera's CDH distribution, but has since become its own project at Apache. Chukwa A data collection and analysis system built on top of HDFS and MapReduce. Tailored for collecting logs and other data from distributed monitoring systems, Chukwa provides a workflow that allows for incremental data collection, processing and storage in Hadoop. Big Data Science 59
60 Drill Hadoop Eco System A distributed system for executing interactive analysis over largescale datasets. Some explicit goals of the Drill project are to support real-time querying of nested data and to scale to clusters of 10,000 nodes or more. Flume A tool for harvesting, aggregating and moving large amounts of log data in and out of Hadoop. HBase Based on Google's Bigtable, HBase "is an open-source, distributed, versioned, column-oriented store" that sits on top of HDFS. HBase is column-based rather than row-based, which enables high-speed execution of operations performed over similar values across massive data sets, e.g. read/write operations that involve all rows but only a small subset of all columns. Big Data Science 60
61 Hadoop Eco System HCatalog A metadata and table storage management service for HDFS. HCatalog depends on the Hive metastore and exposes it to other services such as MapReduce and Pig with plans to expand to HBase using a common data model. Hive Provides a warehouse structure and SQL-like access for data in HDFS and other Hadoop input sources (e.g. Amazon S3). Hive's query language, HiveQL, compiles to MapReduce. It also allows user-defined functions (UDFs). Oozie A job coordinator and workflow manager for jobs executed in Hadoop, which can include non-mapreduce jobs. Big Data Science 61
62 Hadoop Eco System Mahout Pig Mahout is a scalable machine-learning and data mining library. There are currently four main groups of algorithms in Mahout: recommendations, a.k.a. collective filtering classification, a.k.a categorization clustering frequent itemset mining, a.k.a parallel frequent pattern mining Framework consisting of a high-level scripting language (Pig Latin) and a run-time environment that allows users to execute MapReduce on a Hadoop cluster. Like HiveQL in Hive, Pig Latin is a higher-level language that compiles to MapReduce. Big Data Science 62
63 Sqoop Hadoop Eco System Sqoop ("SQL-to-Hadoop") is a tool which transfers data in both directions between relational systems and HDFS or other Hadoop data stores, e.g. Hive or HBase. ZooKeeper A service for maintaining configuration information, naming, providing distributed synchronization and providing group services. Spark (UC Berkeley) A parallel computing program which can operate over any Hadoop input source: HDFS, HBase, Amazon S3, Avro, etc. Spark is an open-source project at the U.C. Berkeley AMPLab, and in its own words, Spark "was initially developed for two applications where keeping data in memory helps: iterative algorithms, which are common in machine learning, and interactive data mining." Big Data Science 63
64 Apache Pig High-Level Data Flow Language Made of two components: Data processing language Pig Latin Complier to translate Pig Latin to MapReduce It abstracts a programmer from specific details and allows to focus on data processing Big Data Science 64
65 Pig in the Hadoop Ecosystem Source: sigmanac.com Big Data Science 65
66 Pig Latin Pig Latin users = LOAD users.text USING PigStorage (, ) AS (name, age); pages = LOAD pages.txt USING PigStorage (, ) AS (user, url); filteredusers = FILTER users BY age >= 18 and age <= 50; joinresult = JOIN filteredusers BY name, ages by user; grouped = GROUP joinresult by url; summed = FOREACH grouped GENERATE group, COUNT(joinResult) as clicks; sorted = ORDER summed BY clicks desc; top10 = LIMIT sorted 10; STORE top10 INTO top10sites ; Big Data Science 66
67 Apache Hive: SQL for Hadoop Data Warehousing Layer on top of Hadoop Allows Analysis and queries using a SQL-like language It is the best for data analysts familiar with SQL who need to do ad-hoc queries, summarization and data analysis Source: sigmanac.com Big Data Science 67
68 Hive Architecture Big Data Science 68
69 Example Hive Example CREATE TABLE users (name STRING, age INT); CREATE TABLE pages (user STRING, url STRING); LOAD DATA INPATH /user/sandbox/users.txt INTO TABLE users ; LOAD DATA INPATH /user/sandbox/pages.txt INTO TABLE pages ; SELECT pages.url, count(*) AS clicks FROM users JOIN pages ON (users.name = pages.user) WHERE users.age >= 18 AND users.age <= 50 GROUP BY pages.url SORT By clicks DESC LIMIT 10; Big Data Science 69
70 Mahout in Hadoop Data Warehousing Layer on top of Hadoop Allows Analysis and queries using a SQL-like language It is the best for data analysts familiar with SQL who need to do ad-hoc queries, summarization and data analysis Copy source: Big Data Science Copyright: NTT Data Co. 70
71 Mahout Package What does Mahout provide: Several Data Mining Algorithms Classification Clustering Association Analysis Recommendations Others 71
72 Classification Assigning some objects to one of several predefined categories: Politics, Economics, Sports of Reuter news paper Some figures of galaxies Classifying credit card transactions as legitimate or fraudulent Classify malicious code or ot Algorithms Naïve Bayes Decision Tree SVM Linear Regression Neural Network Rule-based Methods 72
73 Clustering Grouping objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Partitional Clustering / Hierarchical Clustering K-Means, Fuzzy K-Means, Density- Based, Different distance measures Manhattan, Euclidean, Other domain dependent methods 73
74 Association Analysis Find the frequent item sets in transaction DB <milk, bread, cheese> are sold frequently together Apriori principle Market analysis, access pattern analysis, etc TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke 74
75 Recommendation Two Representative Types of Recommendation Contents Based Recommendation Collaborative Filtering (By rating of similar users) Trust-Based Recommendation Online and Offline Support Several Similarity Measures Cosine, LLR, Pearson Correlation 75
76 Others Outlier detection Math library Vectors, matrices, etc. Noise reduction 76
77 Mahout Example: K-Mean Clustering Big data application supports: Ma-Reduce programming, Pig, Hive, etc Some algorithms such as collaboration filtering, clustering, classification, association is difficult to implement. Example of K-Mean Clustering Big Data Science 77
78 Mahout Example: K-Mean Clustering Short Introduction to Installation and Clustering of Human Development Report Data Set Mahout Source Download: Extracting files and set MAHOUT_HOME variable. Getting data set: ol/synthetic_control. Execution of K-Mean example bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.job Big Data Science 78
79 Initializing K-Mean Algorithm public static void main(string[] args) throws Exception { Path output = new Path("output"); Configuration conf = new Configuration(); HadoopUtil.delete(conf, output); run(conf, new Path("testdata"), output, new EuclideanDistanceMeasure(), 6, 0.5, 10); public int run(string[] args) throws Exception{ addinputoption(); addoutputoption(); addoption(defaultoptioncreator.distancemeasureoption().cre ate()); addoption(defaultoptioncreator.numclustersoption().create( )); addoption(defaultoptioncreator.t1option().create()); addoption(defaultoptioncreator.t2option().create()); addoption(defaultoptioncreator.convergenceoption().create( )); addoption(defaultoptioncreator.maxiterationsoption().create ()); addoption(defaultoptioncreator.overwriteoption().create()); Map<String, String> argmap = parsearguments(args); if (argmap == null) { return -1; } Path input = getinputpath(); Path output = getoutputpath(); String measureclass = getoption(defaultoptioncreator.distance_measure_opt ION); if (measureclass == null) { measureclass = SquaredEuclideanDistanceMeasure.class.getName(); } double convergencedelta = Double.parseDouble(getOption(DefaultOptionCreator.CONVERGENCE_DELTA_OPTION)); int maxiterations = Integer.parseInt(getOption(DefaultOptionCreator.MA X_ITERATIONS_OPTION)); if (hasoption(defaultoptioncreator.overwrite_option) ) { HadoopUtil.delete(getConf(), output); } DistanceMeasure measure = ClassUtils.instantiateAs(measureClass, DistanceMeasure.class); if (hasoption(defaultoptioncreator.num_clusters_opt ION)) { int k = Integer.parseInt(getOption(DefaultOptionCreator.N UM_CLUSTERS_OPTION)); run(getconf(), input, output, measure, k, convergencedelta, maxiterations); } else { double t1 = Double.parseDouble(getOption(DefaultOptionCreator.T1_ OPTION)); double t2 = Double.parseDouble(getOption(DefaultOptionCreator.T2_ OPTION)); run(getconf(), input, output, measure, t1, t2, convergencedelta, maxiterations); } return 0; } Big Data Science 79
80 Setting K-Mean Algorithm public static void run(configuration conf, Path input, Path output, DistanceMeasure measure, int k, double convergencedelta, int maxiterations) throws Exception{ Path directorycontainingconvertedinput = new Path(output, DIRECTORY_CONTAINING_CONVERTED_INPUT); log.info("preparing Input"); InputDriver.runJob(input, directorycontainingconvertedinput, "org.apache.mahout.math.randomaccesssparsevector"); log.info("running random seed to get initial clusters"); Path clusters = new Path(output, Cluster.INITIAL_CLUSTERS_DIR); clusters = RandomSeedGenerator.buildRandom(conf, directorycontainingconvertedinput, clusters, k, measure); log.info("running KMeans"); KMeansDriver.run(conf, directorycontainingconvertedinput, clusters, output, measure, convergencedelta, maxiterations, true, false); // run ClusterDumper ClusterDumper clusterdumper = new ClusterDumper(finalClusterPath(conf, output, maxiterations), new Path(output, "clusteredpoints")); clusterdumper.printclusters(null); } Big Data Science 80
81 Example of Visualization of Clustering Big Data Science 81
82 Research Issues on Big Data Infra- Structure and Application Design of Efficient Map-Reduce for Important Algorithms: Counting(Word Count/TF-IDF)/Sort Several Data Mining Algorithms (Classification/Clustering/Association) Operations on Graphs/Networks Other Applications Application of Hadoop Map-Reduce on Several Domains Web/SNS data/bio/sensors/iot Many Domains: Triple Engine for WoD/ SA on Big Data Big Data Science 82
83 Situation Awareness on Big Data Big Data Science
84 Three layers for situation awareness Integration / Ontology Learning Data Mining / Machine Learning Reasoning / Prediction Inferred Metadata (Projection) Learned Relationship Metadata (Comprehension) Processed Information Entity Metadata (Perception) New Fact / Rules Semantics / Understanding / Insight Information Services for Data from Web/Smart Grid/Web/IOT/Cloud (World) Data Big Data Science
85 Layers for Situation Awareness on Smart Grid Services Projection Layer / Knowledge Extraction, Inference Comprehension Layer / Ontology Learning, Inference Perception Layer / Data Mining, Signal Processing Web Services / SOAP, RESTful Services Smart Grid Network / Sensors, Internet of Things Big Data Science
86 Active Situation Awareness Projection recommendtoparticipate TheEvent(Building, Event) needreplyto (ITM) checkhisevent (ITM) Comprehension (Situation) givehottopic (ITM,ATopicHisBlog) hasevent (Building, Event) israre(event) saycelebration (ITM, myblog) Perception Stand (People, Longline) isat (People, Building) Wrote (ITM, myblog) needreplyto (ITM) World Facebook Twitter Google Web Data Service 86 Big Data Science
87 A Framework for ASA on Big Data Big Data Infrastructure by Hadhoop and Data Mining Algorithm on it. Consistent Web APIs for ASA Upper Layer Big Data Infrastructure by Hadhoop - What is data format of SG? - How can get data from SG? - How Map between SG and Hive? - How can get data from Hive? - How map between Hive and Services? Smart Grid Infrastructure Big Data Science
88 Big Data Infrastructure To ASA Perception Layer Big Data Reasoner ETL Query ETL Map- Reduce Engine Events with Time and Location (ETL) Identification & Collector Big Data Science Big Data from SNS/Web/Sensors
89 Big Data Infrastructure <K,V> scheme for Event-Time-Location is S1: <E, <T,L>> S2: <T, <E,L>> S3: <L, <E,T>> where, E is event, T time and L location. Map-Reduce Function for Event, Time and Location Function: Retrieving E,T,L using Map-Reduce on Hadoop Input: SNS data set // set of words that are split Output: E/T/V related to Event //category of words set Map function for Event: for each word output <event@input, <T,L> pair> Reduce function for Event: for each event@input <T,L> pair set by temporal and spatial relationship reasoning output file reasoned <T,L> pair set - Map-Reduce function for the other element can be obtained the same fashion Big Data Science
90 Research Issues on Big Data Infra- Structure and Application Optimization of Big Data Infrastructure Locality/Network Aware Big Data Infrastructure Pipelining of Partitioning/Shuffling of Map- Reduce Considering Map-Oriented/Reduce-Oriented Operation Block & Job Scheduling for Optimization Big Data Science 90
91 Network-Aware Optimal Data Allocation for Map-Reduce Big Data Science
92 Why heavy network load? Data movement costs in map phrase: if the grouping data are distributed by Hadoop's random strategy, the shaded map tasks with either remote data access or queueing delay are the performance barriers; whereas if these data are evenly distributed, the MapReduce program can avoid these barriers. Intermediate data shuffling cost across racks in reduce phrase. TT j stands for a Task Tracker node j. P Ri:TTj stands for a partition P produced at TT j and hashed to reducer, Shuffling P R0:TT0 from TTO to TT3, P R1:TT2 from TT2 to TTO, and P R1:TT3 from TT3 to TTO, In addition, P R1:TT1 will be shuffled from TTl to TTO and P R0:TT3 will remain local to TT3. Network load: Off-Rack>rack-local>Node-local Big Data Science
93 Example of Minimizing Cost by Data Replacement Big Data Science
94 Research Problem The research problem is: To develop the mathematic models and theories; To develop learning algorithm, For the optimization of objective function defined in Eq. (1) with constrains (2) and (3). Big Data Science
95 Optimization Example It may show intuitively: Replicas of each data blocks are distributed into each main server node clusters. Data blocks whom share high similarity will be replaced closer. Big Data Science
96 References Tom White, Hadoop, OREILLLY, 2011 Srinath Perera, Thilina Gunarathne, Hadoop Map- Reduce Programming, Packt Publishing, 2013 J.H Jeong, Beginning Hadoop Programming: Development and Operations, Wiki Books, 2012 Big Data University, I. Paik, R. Sawa, K. Ofuji, and N. Yen, Introduction to Big Data Science, Graduate School Course, Big Data Science 96
Getting to know Apache Hadoop
Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the
More informationInternals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
More informationExtreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk
Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless
More informationCloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
More informationHadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
More informationHadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart
Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated
More informationWorking With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology
Working With Hadoop Now that we covered the basics of MapReduce, let s look at some Hadoop specifics. Mostly based on Tom White s book Hadoop: The Definitive Guide, 3 rd edition Note: We will use the new
More informationWord Count Code using MR2 Classes and API
EDUREKA Word Count Code using MR2 Classes and API A Guide to Understand the Execution of Word Count edureka! A guide to understand the execution and flow of word count WRITE YOU FIRST MRV2 PROGRAM AND
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationThe Hadoop Eco System Shanghai Data Science Meetup
The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related
More informationIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process
More informationBig Data Management and NoSQL Databases
NDBI040 Big Data Management and NoSQL Databases Lecture 3. Apache Hadoop Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Apache Hadoop Open-source
More informationInternational Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationINTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
More informationBIG DATA APPLICATIONS
BIG DATA ANALYTICS USING HADOOP AND SPARK ON HATHI Boyu Zhang Research Computing ITaP BIG DATA APPLICATIONS Big data has become one of the most important aspects in scientific computing and business analytics
More informationHadoop Design and k-means Clustering
Hadoop Design and k-means Clustering Kenneth Heafield Google Inc January 15, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise
More informationHadoop WordCount Explained! IT332 Distributed Systems
Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,
More informationIntroduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
More informationOverview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
More informationData Science in the Wild
Data Science in the Wild Lecture 3 Some slides are taken from J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 Data Science and Big Data Big Data: the data cannot
More informationXiaoming Gao Hui Li Thilina Gunarathne
Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal
More informationIntroduc)on to Map- Reduce. Vincent Leroy
Introduc)on to Map- Reduce Vincent Leroy Sources Apache Hadoop Yahoo! Developer Network Hortonworks Cloudera Prac)cal Problem Solving with Hadoop and Pig Slides will be available at hgp://lig- membres.imag.fr/leroyv/
More informationHadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?
Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到
More informationHadoop Introduction. Olivier Renault Solution Engineer - Hortonworks
Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013
More informationParallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
More informationUniversity of Maryland. Tuesday, February 2, 2010
Data-Intensive Information Processing Applications Session #2 Hadoop: Nuts and Bolts Jimmy Lin University of Maryland Tuesday, February 2, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationProcessing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems
Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
More informationDistributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
More informationLecture 3 Hadoop Technical Introduction CSE 490H
Lecture 3 Hadoop Technical Introduction CSE 490H Announcements My office hours: M 2:30 3:30 in CSE 212 Cluster is operational; instructions in assignment 1 heavily rewritten Eclipse plugin is deprecated
More informationPeers Techno log ies Pv t. L td. HADOOP
Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and
More informationA very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
More informationHadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
More informationHadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN
Hadoop Framework technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate addi8onal
More informationHadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology
Hadoop Dawid Weiss Institute of Computing Science Poznań University of Technology 2008 Hadoop Programming Summary About Config 1 Open Source Map-Reduce: Hadoop About Cluster Configuration 2 Programming
More informationCS54100: Database Systems
CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics
More informationHadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com
Hadoop and Eclipse Eclipse Hawaii User s Group May 26th, 2009 Seth Ladd http://sethladd.com Goal YOU can use the same technologies as The Big Boys Google Yahoo (2000 nodes) Last.FM AOL Facebook (2.5 petabytes
More informationJeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
More informationHow To Write A Map Reduce In Hadoop Hadooper 2.5.2.2 (Ahemos)
Processing Data with Map Reduce Allahbaksh Mohammedali Asadullah Infosys Labs, Infosys Technologies 1 Content Map Function Reduce Function Why Hadoop HDFS Map Reduce Hadoop Some Questions 2 What is Map
More informationHadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
More informationZebra and MapReduce. Table of contents. 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples...
Table of contents 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples... 2 1. Overview MapReduce allows you to take full advantage of Zebra's capabilities.
More informationIntroduction to Big Data Science. Wuhui Chen
Introduction to Big Data Science Wuhui Chen What is Big data? Volume Variety Velocity Outline What are people doing with Big data? Classic examples Two basic technologies for Big data management: Data
More informationHadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab
IBM CDL Lab Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网 Information Management 2012 IBM Corporation Agenda Hadoop 技 术 Hadoop 概 述 Hadoop 1.x Hadoop 2.x Hadoop 生 态
More informationImplement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
More informationIstanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış
Istanbul Şehir University Big Data Camp 14 Hadoop Map Reduce Aslan Bakirov Kevser Nur Çoğalmış Agenda Map Reduce Concepts System Overview Hadoop MR Hadoop MR Internal Job Execution Workflow Map Side Details
More informationData-Intensive Programming. Timo Aaltonen Department of Pervasive Computing
Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationMapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
More informationA Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
More informationBIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
More informationBig Data Workshop. dattamsha.com
Big Data Workshop About Praveen Has more than15 years of experience working on various technologies. Is a Cloudera Certified Developer for Apache Hadoop CDH4 (CCD-410) with 95% score and got through the
More informationDeploying Hadoop with Manager
Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer plinnell@suse.com Alejandro Bonilla / Sales Engineer abonilla@suse.com 2 Hadoop Core Components 3 Typical Hadoop Distribution
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationData-intensive computing systems
Data-intensive computing systems Hadoop Universtity of Verona Computer Science Department Damiano Carra Acknowledgements! Credits Part of the course material is based on slides provided by the following
More informationCOURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
More informationHadoop. History and Introduction. Explained By Vaibhav Agarwal
Hadoop History and Introduction Explained By Vaibhav Agarwal Agenda Architecture HDFS Data Flow Map Reduce Data Flow Hadoop Versions History Hadoop version 2 Hadoop Architecture HADOOP (HDFS) Data Flow
More informationPrepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
More informationHow To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI
More informationHadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200
Hadoop Learning Resources 1 Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200 Author: Hadoop Learning Resource Hadoop Training in Just $60/3000INR
More informationMap Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
More informationPro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah
Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big
More informationHadoop Basics with InfoSphere BigInsights
An IBM Proof of Technology Hadoop Basics with InfoSphere BigInsights Unit 2: Using MapReduce An IBM Proof of Technology Catalog Number Copyright IBM Corporation, 2013 US Government Users Restricted Rights
More informationConstructing a Data Lake: Hadoop and Oracle Database United!
Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.
More informationApache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
More informationGoogle Bing Daytona Microsoft Research
Google Bing Daytona Microsoft Research Raise your hand Great, you can help answer questions ;-) Sit with these people during lunch... An increased number and variety of data sources that generate large
More informationCloud Computing Era. Trend Micro
Cloud Computing Era Trend Micro Three Major Trends to Chang the World Cloud Computing Big Data Mobile 什 麼 是 雲 端 運 算? 美 國 國 家 標 準 技 術 研 究 所 (NIST) 的 定 義 : Essential Characteristics Service Models Deployment
More informationHadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
More informationLambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014
Lambda Architecture CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 1 Goals Cover the material in Chapter 8 of the Concurrency Textbook The Lambda Architecture Batch Layer MapReduce
More informationand HDFS for Big Data Applications Serge Blazhievsky Nice Systems
Introduction PRESENTATION to Hadoop, TITLE GOES MapReduce HERE and HDFS for Big Data Applications Serge Blazhievsky Nice Systems SNIA Legal Notice The material contained in this tutorial is copyrighted
More informationHADOOP. Revised 10/19/2015
HADOOP Revised 10/19/2015 This Page Intentionally Left Blank Table of Contents Hortonworks HDP Developer: Java... 1 Hortonworks HDP Developer: Apache Pig and Hive... 2 Hortonworks HDP Developer: Windows...
More informationHadoop: The Definitive Guide
FOURTH EDITION Hadoop: The Definitive Guide Tom White Beijing Cambridge Famham Koln Sebastopol Tokyo O'REILLY Table of Contents Foreword Preface xvii xix Part I. Hadoop Fundamentals 1. Meet Hadoop 3 Data!
More informationLecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
More informationCS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
More informationCS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
More informationApache Hadoop: Past, Present, and Future
The 4 th China Cloud Computing Conference May 25 th, 2012. Apache Hadoop: Past, Present, and Future Dr. Amr Awadallah Founder, Chief Technical Officer aaa@cloudera.com, twitter: @awadallah Hadoop Past
More informationHDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"
HDInsight Essentials Rajesh Nadipalli Chapter No. 1 "Hadoop and HDInsight in a Heartbeat" In this package, you will find: A Biography of the author of the book A preview chapter from the book, Chapter
More informationLarge-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
More informationHadoop MapReduce: Review. Spring 2015, X. Zhang Fordham Univ.
Hadoop MapReduce: Review Spring 2015, X. Zhang Fordham Univ. Outline 1.Review of how map reduce works: the HDFS, Yarn sorting and shuffling advanced topics: partial sort, total sort, join, chained mapper/reducer,
More informationDATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationWeekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
More informationt] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from
Hadoop Beginner's Guide Learn how to crunch big data to extract meaning from data avalanche Garry Turkington [ PUBLISHING t] open source I I community experience distilled ftu\ ij$ BIRMINGHAMMUMBAI ')
More informationGetting Started with Hadoop. Raanan Dagan Paul Tibaldi
Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop
More informationTutorial- Counting Words in File(s) using MapReduce
Tutorial- Counting Words in File(s) using MapReduce 1 Overview This document serves as a tutorial to setup and run a simple application in Hadoop MapReduce framework. A job in Hadoop MapReduce usually
More informationHADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM
HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM 1. Introduction 1.1 Big Data Introduction What is Big Data Data Analytics Bigdata Challenges Technologies supported by big data 1.2 Hadoop Introduction
More informationHadoop Job Oriented Training Agenda
1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module
More informationHadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
More informationHadoop (Hands On) Irene Finocchi and Emanuele Fusco
Hadoop (Hands On) Irene Finocchi and Emanuele Fusco Big Data Computing March 23, 2015. Master s Degree in Computer Science Academic Year 2014-2015, spring semester I.Finocchi and E.Fusco Hadoop (Hands
More informationmap/reduce connected components
1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains
More informationHadoop: Understanding the Big Data Processing Method
Hadoop: Understanding the Big Data Processing Method Deepak Chandra Upreti 1, Pawan Sharma 2, Dr. Yaduvir Singh 3 1 PG Student, Department of Computer Science & Engineering, Ideal Institute of Technology
More informationData-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion
More informationCloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
More informationBIG DATA HADOOP TRAINING
BIG DATA HADOOP TRAINING DURATION 40hrs AVAILABLE BATCHES WEEKDAYS (7.00AM TO 8.30AM) & WEEKENDS (10AM TO 1PM) MODE OF TRAINING AVAILABLE ONLINE INSTRUCTOR LED CLASSROOM TRAINING (MARATHAHALLI, BANGALORE)
More informationHadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb
More informationHadoop Configuration and First Examples
Hadoop Configuration and First Examples Big Data 2015 Hadoop Configuration In the bash_profile export all needed environment variables Hadoop Configuration Allow remote login Hadoop Configuration Download
More informationArchitectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
More information