Tutorial, IEEE SERVICE 2014 Anchorage, Alaska Big Data Science: Fundamental, Techniques, and Challenges (HDFS & Map-Reduce)

Transcription

1 Tutorial, IEEE SERVICE 2014 Anchorage, Alaska Big Data Science: Fundamental, Techniques, and Challenges (HDFS & Map-Reduce) Incheon Paik University of Aizu, Japan Big Data Science 1

2 Contents What is Big Data? Big Data Source Clustered File System & Distributed File System Hadoop Distributed File System (HDFS) Map-Reduce Operation Application of Map-Reduce Big Data Science 2

3 Overview of Big Data Science What is Big Data? Recently, there have been very large and complex data sets from nature, sensors, social networks, enterprises increasingly based on high speed computers and networks together. Big data is the term for a collection of the data sets that it becomes difficult to process using on-hand database management tools or traditional data processing applications. Big Data Science 3

4 Overview of Big Data Science Several Areas Meteorology Genomics Connectomics Complex physics simulations Biological and environmental research Internet search Finance and business informatics Data sets ubiquitous information-sensing mobile devices Aerial sensory technologies (remote sensing) Software logs Cameras Microphones Radio-frequency identification readers Wireless sensor networks Big Data Science 4

5 Clustered File System A file system which is shared by being simultaneously mounted on multiple servers. Clustered file systems can provide features like Location-independent addressing Redundancy which improve reliability or reduce the complexity of the other parts of the cluster. Parallel file systems are a type of clustered file system that spread data across multiple storage nodes, usually for redundancy or performance. Big Data Science 5

6 Clustered File System (DAS/NAS/SAN) Source: Big Data Science 6

7 Fundamental Differences in DAS/NAS/SAN Source: he.wikipedia.org Big Data Science 7

8 Storage Area Network Source: pccgroup.com Big Data Science 8

9 Shared-Disk / Storage Area Network A shared-disk file-system uses a storage-area network (SAN) to provide direct disk access from multiple computers at the block level. Access control and translation from file-level operations (application) to block-level operations (by SAN) must take place on the client node. Adds a mechanism for concurrency control which gives a consistent and serializable view of the file system, avoiding corruption and unintended data loss Usually employ some sort of a fencing mechanism to prevent data corruption in case of node failures. Big Data Science 9

10 Shared-Disk / Storage Area Network The SAN may use any of a number of block-level protocols, including SCSI, iscsi, HyperSCSI, ATA over Ethernet, Fibre Channel, etc. There are different architectural approaches to a shared-disk file-system. Distribute file information across all the servers in a cluster (fully distributed). Utilize a centralized metadata server. Both achieve the same result of enabling all servers to access all the data on a shared storage device. Big Data Science 10

11 Network Attached Storage (NAS) Source: Big Data Science 11

12 Network Attached Storage (NAS) Provides both storage and a file system, like a shared disk file system on top of a SAN. Typically uses file-based protocols (as opposed to block-based protocols a SAN would use) such as NFS, SMB/CIFS, AFP, or NCP. Design Considerations Avoiding single point of failure: Fault tolerance and high availability by data replication of one sort or another Performance: fast disk-access time and small amount of CPU-processing time over distributed structure Concurrency: for consistent and efficient multiple accesses to the same file or block by concurrency control or locking which may either be built into the file system or provided by an add-on protocol Big Data Science 12

13 NAS vs. SAN Source: turbotekcomputer.com Big Data Science 13

14 Distributed File System Distributed file systems do not share block level access to the same storage but use a network protocol. These are commonly known as network file systems, even though they are not the only file systems that use the network to send data. Distributed file systems can restrict access to the file system depending on access lists or capabilities on both the servers and the clients, depending on how the protocol is designed. Big Data Science 14

15 Distributed File System The difference between a distributed file system and a distributed data store Distributed file system (DFS) allows files to be accessed using the same interfaces and semantics as local files - e.g. mounting/unmounting, listing directories, read/write at byte boundaries, system's native permission model. Distributed data stores, by contrast, require using a different API or library and have different semantics (most often those of a database). Big Data Science 15

16 Distributed File System Design Goal of DFS Access transparency: Clients are unaware that files are distributed and can access them in the same way as local files are accessed. Location transparency: A consistent name space exists encompassing local as well as remote files. The name of a file does not give its location. Concurrency transparency: All clients have the same view of the state of the file system. This means that if one process is modifying a file, any other processes on the same system or remote systems that are accessing the files will see the modifications in a coherent manner. Big Data Science 16

17 Distributed File System Design Goal of DFS Failure transparency: The client and client programs should operate correctly after a server failure. Heterogeneity: File service should be provided across different hardware and operating system platforms. Scalability: The file system should work well in small environments (1 machine, a dozen machines) and also scale gracefully to huge ones (hundreds through tens of thousands of systems). Replication transparency: To support scalability, we may wish to replicate files across multiple servers. Clients should be unaware of this. Migration transparency: Files should be able to move around without the client's knowledge. Big Data Science 17

18 Examples Distributed File System GFS (Google Inc.) Ceph (Inktank) Windows Distributed File System (DFS) (Microsoft) FhGFS (Fraunhofer) GlusterFS (Red Hat) Lustre Ibrix Big Data Science 18

19 Hadoop Distributed File System (HDFS) The existing file system such as large file storage, NAS or SAN is expensive. It needs very high performance server. HDFS can combine Web server level hosts into a disk storage. Some high performance large storage system will be needed. SAN is suitable for DBMS, NAS for safe file store. It can process big data concurrently using Map-Reduce. Four Fundamental Objectives of HDFS Error recovery Access data using streaming Large data storage Data integrity Big Data Science 19

20 Hadoop Distributed File System Big Data Science 20

21 Block Structured File System File Copy Structure on HDFS 320 MB File Block 1 Block 2 Block 3 Block 4 Block 5 HDFS Block 1 Block 3 Block 4 Block 2 Block 3 Block 4 Block 1 Block 3 Block 5 Block 2 Block 4 Block 5 Block 1 Block 2 Block 5 Big Data Science 21

22 Block Structured File System Source:bradhedlund.com Big Data Science 22

23 Block Structured File System Source:bradhedlund.com Big Data Science 23

24 NoSQL HDFS,NoSQL, Map-Reduce No SQL database provides a mechanism for storage and retrieval of data that employs less constrained consistency models than traditional relational databases. Motivations for this approach include simplicity of design, horizontal scaling and finer control over availability. NoSQL databases are often highly optimized key value stores intended for simple retrieval and appending operations, with the goal being significant performance benefits in terms of latency and throughput. Big Data Science 24

25 Map Operation What is a Map operation Doing something to every element in an array is a common operation. The operation can be considered as Map function. int a = {1,2,3}; for (I = 0; I < a.length; i++) a[i] = a[i] * 2; This can be written as a function. New value for the variable a would be int a = {2,4,6}; Map-Reduce Operation Source:bigdatauniversity.com Big Data Science 25

26 Map Operation What is a Map operation Doing something to every element in an array is a common operation. The operation can be considered as Map function. int a = {1,2,3}; for (I = 0; I < a.length; i++) a[i] = f x (a[i]); All of this can also be converted into a map function. New value for the variable a would be int a = {2,4,6}; Like this, where f x is a function defined as : function f x (x) {return x * 2 ;} Big Data Science 26

27 Map Operation What is a Map operation like this, where f x is a function passed as an argument: function map (f x, a) { } for (I = 0; I < a.length; i++) a[i] = f x (a[i]); You can invoke this map function as follows map(function(x) {return x*2; }, a); This is function f x whose defintion is included in the call. Big Data Science 27

28 Reduce Operation What is a Reduce operation Another common operation on arrays is to combine all their value: function sum(a) { int s = 0; for (i = 0; I < a.length; i++) s += a[i] ; return s; } This can be written as a function. Big Data Science 28

29 Reduce Operation What is a Reduce operation Another common operation on arrays is to combine all their values: function sum(a) { int s = 0; for (i = 0; I < a.length; i++) s = f x (s, a[i]) ; return s; } Like this, where a function f x is defined so that it adds its arguments: function f x (a,b) {return a+b ;} The whole function sum can also be rewritten so that f x is passed as an argument. as a reduce operation: reduce(function(a,b) {return a+b; }, a, 0); Big Data Science 29

30 Submitting a MapReduce Job Source:bigdatauniversity.com Big Data Science 30

31 Shuffle Process 1. Split Creation 2. Map 3. Spill 4. Merge 5. Copy 6. Sort 7. Reduce Input Split Map1 Reduce1 Input Split Map2 Reduce2 Input Split Map3 Memory Buffer Partition Map Output Data Memory Buffer, File File Source: Beginning Hadoop Programming Big Data Science 31

32 MapReduce Type and Data Flow Lists Source:bigdatauniversity.com Big Data Science 32

33 HADOOP MapReduce Program : WordCount.java import java.io.ioexception; import java.util.stringtokenizer; import org.apache.hadoop.util.genericoptionsparser; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.longwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.input.textinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; import org.apache.hadoop.mapreduce.lib.output.textoutputformat; public class WordCount { // Mapper Implementation public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new protected void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); } } } // Reducer Implementation public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable value = new protected void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) sum += value.get(); value.set(sum); context.write(key, value); } } public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); // Parameter Passing GenericOptionsParser parser = new GenericOptionsParser(conf, args); args = parser.getremainingargs(); // Creating of Job Job job = new Job(conf, "wordcount"); job.setjarbyclass(wordcount.class); // Assign Classes for Mapper/Reducer job.setmapperclass(map.class); job.setreducerclass(reduce.class); // Other Setting for Job job.setmapoutputkeyclass(text.class); job.setmapoutputvalueclass(intwritable.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); Big Data Science 33

34 Map-Reduce Architecture on Hadoop Client Job Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Map Map Map Reduce Reduce Reduce Map Map Map Reduce Reduce Reduce Map Map Map Reduce Reduce Reduce Source: Beginning Hadoop Programming Big Data Science 34

35 Map-Reduce Architecture on Hadoop Client: Map-Reduce API provided by Hadoop and Map-Reduce Program executed by a user. Job Tracker A Map-Reducer program is managed as a work unit, Job. Manage scheduling and monitoring of entire jobs on Hadoop cluster. Running on a Name Node server usually, but necessarily. Description of Job Tracker When a user requests a new job, it calculates how many Maps and Reduces will be executed for the job. Big Data Science 35

36 Map-Reduce Architecture on Hadoop Description of Job Tracker A Map-Reduce program that executed by a client through Hadoop is managed by the unit of Job. The Job Tracker monitors scheduling of all jobs registered on Hadoop cluster. But we don t have to execute the job tracker on name node server. Decide Task Trackers where the Maps-Reduces will be executed and assign the job to the Task Trackers decided. -> Then the task trackers execute the Map- Reduce programs. The job tracker communicates with task trackers with heart-beats for checking status of task trackers and job execution information. Big Data Science 36

37 Map-Reduce Architecture on Hadoop Task Tracker A demon program that executes Map-Reduce program of a user on a Data Node. Creates Map-Tasks and Reduce-Tasks that the Job Tracker requested. Execute the tasks (Map and Reduces) by running new JVM. Workflow of Map-Reduce A user executes a job by invoking waitforcompletion method. A job client object is created in the job interface, and the object access to the Job Tracker to execute the job. Big Data Science 37

38 Name Node Map-Reduce Operation on Hadoop Data Node Client 1. Map-Reduce Job Execute Job Client Input Data 2. New Job Submit 3. Input Split Creation Job Tracker 4. Job Assignment Task Tracker Mapper Mapper 5. Mapper Execution Task Task Job Task Task Task Partitioner Key-Value Key-Value 6. Sort and Merge Reducer Reducer 7. Reducer Execution Key-Value Key-Value Output Data 8. Output Data Save Source: Beginning Hadoop Programming Big Data Science 38

39 Map-Reduce Architecture on Hadoop Workflow of Map-Reduce Then, the Job Tracker returns a new job ID to the Job Client. And the Job Client checks the output path defined by the user. The Job Client calculates input split (Size of input file to be processed in a Map procedure) about the input data of the job. After the calculation of the input split, it saves input split information, setting file, and Map- Reduce JAR file to HDFS, and notifies completion of preparation for starting the job. The Job Tracker registers the job on the Queue, Job scheduler initializes the job from the Queue. And then, the Job Scheduler creates Map tasks according to the input split information and assign IDs. Big Data Science 39

40 Map-Reduce Architecture on Hadoop Workflow of Map-Reduce The Task Tracker gives its status to the Job Tracker periodically calling heartbeat method. In heartbeat information, there are resource information about CPU, memory, disk, number of tasks, and available maximum tasks, new tasks execution possibilities. The Task Tracker executes the Map task assigned by the Job Tracker. The Map task executes the logic in the Map method, and save output data to memory buffer. At this time, a patitioner decides a Reduce task to which the output data should be transferred, and assign a suitable partition for it. The data in memory will be sorted by the keys, and saved to local hard disk. When the save is to be completed, the files will be sorted and merged into one file. Big Data Science 40

41 Map-Reduce Architecture on Hadoop Workflow of Map-Reduce A reduce task is the Reducer class written by a user. The task can be executed when all output data are to be prepared. When the Map task completes output, it notifies its completion to the Task Tracker that ran itself. When the Task Tracker receives the message, it notifies status of the corresponding Map task and output data path of the Map task to the Job Tracker. The Reduce task copies output data of all Map tasks, and merges output data of the Map tasks. When the merge is to be completed, it executes analysis logic to call Reduce method. The Reduce task saves the output data to HDFS as name of part-nnnnn. Where nnnnn means partition ID. Big Data Science 41

42 Component of Map-Reduce Programming Data Types The Map-Reduce is the optimized object for network communication, and provides WritableComparable interface. All data types for Key and Value in Map-Reduce program should implement the WritableComparable interface. The WritableComparable interface inherits the Writable and Comparable interfaces. The write method serializes data value, and the readfields method unserializes to read the serialized data. public interface Writable { void write (DataOutput out) throws IOException; void readfields (DataInput in) throws IOException; } Big Data Science 42

43 Component of Map-Reduce Programming Data Types The Wrapper class to implement the WritableComparable interface for each data type: BoolleanWritable, ByteWritable, DoubleWritable, FloatWritable, IntWritable, LongWritable, TextWritable, NullWritable. InputFormat The InputFormat abstract class let us use it as input parameter of the map method. public abstract class InputFormat(K, V> { public abstract List<InputSplit> getsplits(jobcontext context) throws IOException, InterruptedException; public abstract RecordReader<K, V> createrecordreader(inputsplit split, TaskAttemptContext context) throws IOException, InterruptedException; } Big Data Science 43

44 Component of Map-Reduce Programming InputFormat getsplits : the map can use input split. createrecordreader : creates RecordReader object so that the map method uses input split in the form of key and list. The map method executes analysis logic to read key and value in the RecordReader object. Kinds of the InputFormat: TextInputFormat, KeyValueTextInputFormat, NLineInputFormat, DelegatingInputFormat, CombineFileInputFormat, SequenceFileInputFormat, SequenceFileAsBinaryInputFormat, SequenceFileAsTextInputFormat Big Data Science 44

45 Component of Map-Reduce Programming Mapper Class Carry out function of map method of Map-Reduce programming. It receives input data consisting key and value, then process and classify the data to create new data list. Class definition: using <Input Key Type, Input Value Type, Output Key Type, Output Value Type> Context object: Getting information about job, and read input split as unit of record. The RecordReader object is passed to the map method to read data in the form of key and value. A Map programmer overwrites the map method. Next, when the run method is to be executed, the map method will be executed for all keys in the Context object. Big Data Science 45

46 Component of Map-Reduce Programming Mapper Class public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { public class Context extends MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> { public Context(Configuration conf, TaskAttemptID taskid, RecordReader<KEYIN,VALUEIN> reader, RecordWriter<KEYOUT,VALUEOUT> writer, OutputCommitter committer, StatusReporter reporter, InputSplit split) throws IOException, InterruptedException { super(conf, taskid, reader, writer, committer, reporter, split); } protected void map(keyin key, VALUEIN value, Context context) throws IOException, InterruptedException { context.write((keyout) key, (VALUEOUT) value); } Public void run(context context) throws IOException, InterruptedException { setup (context); while (context.nextkeyvalue()) { map(context.getcurrentkey(), context.getcurrentvalue(), context); } } Big Data Science 46

47 Component of Map-Reduce Programming Partitioner To decide to which Reduce task the data of Map task will be passed. Calculating a partition as key.hashcode() & Integer.MAX_VALUE) % numreducetasks According to the result, the partition will be created at the node that Map task has been executed, and output data of the Map task is saved. When the Map task is completed, the data at the partition will be transmitted to corresponding Reduce task through network. The getpartition method can be overwritten for other partitioning strategy. public class HashPartitioner<K2, V2> implements Partitioner<K2, V2> { public int getpartition(k2 key, V2 value, int numreducetasks) { return (key.hashcode() & Integer.MAX_VALUE) % numreducetasks; } } Big Data Science 47

48 Component of Map-Reduce Programming Reducer The Reducer class receives the output data of the Map task as input data to execute aggregation operation. Class definition: using <Input Key Type, Input Value Type, Output Key Type, Output Value Type> Like Mapper class, the Context object: inherits the ReduceContext. Consult with job for information, getting the information as RawKeyValueIterator for checking input value list. RecordWriter as parameter so that output result of map method can be as unit of record. A Reducer programmer overwrites the reduce method. Big Data Science 48

49 Component of Map-Reduce Programming Reducer Class public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> { public class extends ReduceContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> { public Context(Configuration conf, TaskAttemptID taskid, RawKeyValueIterator input, Counter inputkeycounter, Counter inputvaluecounter, RecordWriter<KEYOUT,VALUEOUT> output, OutputCommitter committer, StatusReporter reporter, RawComparator<KEYIN> comparator, Class<KEYIN> keyclass, Class<VALUEIN> valueclass ) throws IOException, InterruptedException { super(conf, taskid, input, inputkeycounter, inputvaluecounter, output, committer, reporter, comparator, keyclass, valueclass); } } protected void reduce(keyin key, Iterable<VALUEIN> values, Context context ) throws IOException, InterruptedException { for(valuein value: values) { context.write((keyout) key, (VALUEOUT) value); } } public void run(context context) throws IOException, InterruptedException { setup(context); while (context.nextkey()) { reduce(context.getcurrentkey(), context.getvalues(), context); } cleanup(context); } } // end of class Reducer Big Data Science 49

50 Component of Map-Reduce Programming Combiner Class Shuffle: Transportation process between Map task and Reduce task. It needs to reduce the amount of data that will be transferred through network to improve performance of entire jobs. The Combiner class inputs the output data of Mapper and creates reduced data on the local node to send through network. The Reducer should produce the same output in the both case: with the Combiner and without the class. Big Data Science 50

51 Component of Map-Reduce Programming OutputFormat Output data format of Map-Reduce is created by the format of setoutputformatclass method in Job interface. Output data format is made by inheriting the abstract class OutputFormat. Kinds of the OutputFormat: TextOutputFormat, SequenceFileOutputFormat, SequenceFileAsBinaryOutputFormat, FilterOutputFormat, LazyOutputFormat, NullOutputFormat. The classes inherit the FileOutputFormat class which inherits the OutputFormat class. Big Data Science 51

52 Document Frequency But document frequency (df) may be better: df = number of docs in a corpus containg the term Word cf df pandemic influenza Document/Collection frequency weighting is only possible in known(static) collection. So how do we make use of df? Big Data Science 52

53 TF IDF term weights TF IDF measure combines: Term Frequency (TF) or wf, some measure of term density in a doc Inverse document frequency (IDF) measure of informativeness of a term: its rarity across the whole corpus Could just be raw count of number of documents the term occurs in (IDF i = 1/DF i ) but by far the most commonly used version is: IDF i = log (n/df i ) Big Data Science 53

54 Summary : TF IDF (or tf.idf) Assign a tf.idf weight to each term i in each document d Increases with the number of occurrences within a doc Increases with the rarity of the term across the whole corpus Big Data Science 54

55 HADOOP MapReduce Program : Calculating TF-IDF Function: Calculating TF-IDF using Map-Reduce on Hadoop Input: word in document // set of words that are split Output: TF-IDF for each word //category of words set 1st Map function: for each word output <word@document title,1> 1st Reduce function: for each word@document n = number of each word@document file output file word@documet title, n 2nd Map function for each word@document title output <document title,word=n> 2nd Reduce function for each document N = number of word in all documents for each word output word@document title, n/n 3rd Map function for each word@document output <word,document title=n/n> 3rd Reduce function for each word D = number of word in all documents for each document d = number of word in document calculate TF-IDF value use n, N, d and D output word@documet,tf-idf value } Big Data Science 55

56 Hadoop Eco System Source: Big Data Science 56

57 Hadoop Eco System The Hadoop HDFS and Map-Reduce function are useful for Big data processing, but just for specific Java developers and providing low level processing. The system makes Hadoop more usable and layman-accessible with user friendly manner. It does not mean to all be used together, but as parts of a single organism; some may even be seeking to solve the same problem in different ways. Big Data Science 57

58 YARN Yet Another Resource Negotiator (YARN) addresses problems with MapReduce 1.0 s architecture, specifically with the JobTracker service. YARN "split[s] up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global Resource Manager (RM) and perapplication ApplicationMaster (AM)." (source: Apache) Thus, rather than burdening a single node with handling scheduling and resource management for the entire cluster, YARN now distributes this responsibility across the cluster. Big Data Science 58

59 Avro Hadoop Eco System A framework for performing remote procedure calls and data serialization. In the context of Hadoop, it can be used to pass data from one program or language to another, e.g. from C to Pig. BigTop A project for packaging and testing the Hadoop ecosystem. Much of BigTop's code was initially developed and released as part of Cloudera's CDH distribution, but has since become its own project at Apache. Chukwa A data collection and analysis system built on top of HDFS and MapReduce. Tailored for collecting logs and other data from distributed monitoring systems, Chukwa provides a workflow that allows for incremental data collection, processing and storage in Hadoop. Big Data Science 59

60 Drill Hadoop Eco System A distributed system for executing interactive analysis over largescale datasets. Some explicit goals of the Drill project are to support real-time querying of nested data and to scale to clusters of 10,000 nodes or more. Flume A tool for harvesting, aggregating and moving large amounts of log data in and out of Hadoop. HBase Based on Google's Bigtable, HBase "is an open-source, distributed, versioned, column-oriented store" that sits on top of HDFS. HBase is column-based rather than row-based, which enables high-speed execution of operations performed over similar values across massive data sets, e.g. read/write operations that involve all rows but only a small subset of all columns. Big Data Science 60

61 Hadoop Eco System HCatalog A metadata and table storage management service for HDFS. HCatalog depends on the Hive metastore and exposes it to other services such as MapReduce and Pig with plans to expand to HBase using a common data model. Hive Provides a warehouse structure and SQL-like access for data in HDFS and other Hadoop input sources (e.g. Amazon S3). Hive's query language, HiveQL, compiles to MapReduce. It also allows user-defined functions (UDFs). Oozie A job coordinator and workflow manager for jobs executed in Hadoop, which can include non-mapreduce jobs. Big Data Science 61

62 Hadoop Eco System Mahout Pig Mahout is a scalable machine-learning and data mining library. There are currently four main groups of algorithms in Mahout: recommendations, a.k.a. collective filtering classification, a.k.a categorization clustering frequent itemset mining, a.k.a parallel frequent pattern mining Framework consisting of a high-level scripting language (Pig Latin) and a run-time environment that allows users to execute MapReduce on a Hadoop cluster. Like HiveQL in Hive, Pig Latin is a higher-level language that compiles to MapReduce. Big Data Science 62

63 Sqoop Hadoop Eco System Sqoop ("SQL-to-Hadoop") is a tool which transfers data in both directions between relational systems and HDFS or other Hadoop data stores, e.g. Hive or HBase. ZooKeeper A service for maintaining configuration information, naming, providing distributed synchronization and providing group services. Spark (UC Berkeley) A parallel computing program which can operate over any Hadoop input source: HDFS, HBase, Amazon S3, Avro, etc. Spark is an open-source project at the U.C. Berkeley AMPLab, and in its own words, Spark "was initially developed for two applications where keeping data in memory helps: iterative algorithms, which are common in machine learning, and interactive data mining." Big Data Science 63

64 Apache Pig High-Level Data Flow Language Made of two components: Data processing language Pig Latin Complier to translate Pig Latin to MapReduce It abstracts a programmer from specific details and allows to focus on data processing Big Data Science 64

65 Pig in the Hadoop Ecosystem Source: sigmanac.com Big Data Science 65

66 Pig Latin Pig Latin users = LOAD users.text USING PigStorage (, ) AS (name, age); pages = LOAD pages.txt USING PigStorage (, ) AS (user, url); filteredusers = FILTER users BY age >= 18 and age <= 50; joinresult = JOIN filteredusers BY name, ages by user; grouped = GROUP joinresult by url; summed = FOREACH grouped GENERATE group, COUNT(joinResult) as clicks; sorted = ORDER summed BY clicks desc; top10 = LIMIT sorted 10; STORE top10 INTO top10sites ; Big Data Science 66

67 Apache Hive: SQL for Hadoop Data Warehousing Layer on top of Hadoop Allows Analysis and queries using a SQL-like language It is the best for data analysts familiar with SQL who need to do ad-hoc queries, summarization and data analysis Source: sigmanac.com Big Data Science 67

68 Hive Architecture Big Data Science 68

69 Example Hive Example CREATE TABLE users (name STRING, age INT); CREATE TABLE pages (user STRING, url STRING); LOAD DATA INPATH /user/sandbox/users.txt INTO TABLE users ; LOAD DATA INPATH /user/sandbox/pages.txt INTO TABLE pages ; SELECT pages.url, count(*) AS clicks FROM users JOIN pages ON (users.name = pages.user) WHERE users.age >= 18 AND users.age <= 50 GROUP BY pages.url SORT By clicks DESC LIMIT 10; Big Data Science 69

70 Mahout in Hadoop Data Warehousing Layer on top of Hadoop Allows Analysis and queries using a SQL-like language It is the best for data analysts familiar with SQL who need to do ad-hoc queries, summarization and data analysis Copy source: Big Data Science Copyright: NTT Data Co. 70

71 Mahout Package What does Mahout provide: Several Data Mining Algorithms Classification Clustering Association Analysis Recommendations Others 71

72 Classification Assigning some objects to one of several predefined categories: Politics, Economics, Sports of Reuter news paper Some figures of galaxies Classifying credit card transactions as legitimate or fraudulent Classify malicious code or ot Algorithms Naïve Bayes Decision Tree SVM Linear Regression Neural Network Rule-based Methods 72

73 Clustering Grouping objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Partitional Clustering / Hierarchical Clustering K-Means, Fuzzy K-Means, Density- Based, Different distance measures Manhattan, Euclidean, Other domain dependent methods 73

74 Association Analysis Find the frequent item sets in transaction DB <milk, bread, cheese> are sold frequently together Apriori principle Market analysis, access pattern analysis, etc TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke 74

75 Recommendation Two Representative Types of Recommendation Contents Based Recommendation Collaborative Filtering (By rating of similar users) Trust-Based Recommendation Online and Offline Support Several Similarity Measures Cosine, LLR, Pearson Correlation 75

76 Others Outlier detection Math library Vectors, matrices, etc. Noise reduction 76

77 Mahout Example: K-Mean Clustering Big data application supports: Ma-Reduce programming, Pig, Hive, etc Some algorithms such as collaboration filtering, clustering, classification, association is difficult to implement. Example of K-Mean Clustering Big Data Science 77

78 Mahout Example: K-Mean Clustering Short Introduction to Installation and Clustering of Human Development Report Data Set Mahout Source Download: Extracting files and set MAHOUT_HOME variable. Getting data set: ol/synthetic_control. Execution of K-Mean example bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.job Big Data Science 78

79 Initializing K-Mean Algorithm public static void main(string[] args) throws Exception { Path output = new Path("output"); Configuration conf = new Configuration(); HadoopUtil.delete(conf, output); run(conf, new Path("testdata"), output, new EuclideanDistanceMeasure(), 6, 0.5, 10); public int run(string[] args) throws Exception{ addinputoption(); addoutputoption(); addoption(defaultoptioncreator.distancemeasureoption().cre ate()); addoption(defaultoptioncreator.numclustersoption().create( )); addoption(defaultoptioncreator.t1option().create()); addoption(defaultoptioncreator.t2option().create()); addoption(defaultoptioncreator.convergenceoption().create( )); addoption(defaultoptioncreator.maxiterationsoption().create ()); addoption(defaultoptioncreator.overwriteoption().create()); Map<String, String> argmap = parsearguments(args); if (argmap == null) { return -1; } Path input = getinputpath(); Path output = getoutputpath(); String measureclass = getoption(defaultoptioncreator.distance_measure_opt ION); if (measureclass == null) { measureclass = SquaredEuclideanDistanceMeasure.class.getName(); } double convergencedelta = Double.parseDouble(getOption(DefaultOptionCreator.CONVERGENCE_DELTA_OPTION)); int maxiterations = Integer.parseInt(getOption(DefaultOptionCreator.MA X_ITERATIONS_OPTION)); if (hasoption(defaultoptioncreator.overwrite_option) ) { HadoopUtil.delete(getConf(), output); } DistanceMeasure measure = ClassUtils.instantiateAs(measureClass, DistanceMeasure.class); if (hasoption(defaultoptioncreator.num_clusters_opt ION)) { int k = Integer.parseInt(getOption(DefaultOptionCreator.N UM_CLUSTERS_OPTION)); run(getconf(), input, output, measure, k, convergencedelta, maxiterations); } else { double t1 = Double.parseDouble(getOption(DefaultOptionCreator.T1_ OPTION)); double t2 = Double.parseDouble(getOption(DefaultOptionCreator.T2_ OPTION)); run(getconf(), input, output, measure, t1, t2, convergencedelta, maxiterations); } return 0; } Big Data Science 79

80 Setting K-Mean Algorithm public static void run(configuration conf, Path input, Path output, DistanceMeasure measure, int k, double convergencedelta, int maxiterations) throws Exception{ Path directorycontainingconvertedinput = new Path(output, DIRECTORY_CONTAINING_CONVERTED_INPUT); log.info("preparing Input"); InputDriver.runJob(input, directorycontainingconvertedinput, "org.apache.mahout.math.randomaccesssparsevector"); log.info("running random seed to get initial clusters"); Path clusters = new Path(output, Cluster.INITIAL_CLUSTERS_DIR); clusters = RandomSeedGenerator.buildRandom(conf, directorycontainingconvertedinput, clusters, k, measure); log.info("running KMeans"); KMeansDriver.run(conf, directorycontainingconvertedinput, clusters, output, measure, convergencedelta, maxiterations, true, false); // run ClusterDumper ClusterDumper clusterdumper = new ClusterDumper(finalClusterPath(conf, output, maxiterations), new Path(output, "clusteredpoints")); clusterdumper.printclusters(null); } Big Data Science 80

81 Example of Visualization of Clustering Big Data Science 81

82 Research Issues on Big Data Infra- Structure and Application Design of Efficient Map-Reduce for Important Algorithms: Counting(Word Count/TF-IDF)/Sort Several Data Mining Algorithms (Classification/Clustering/Association) Operations on Graphs/Networks Other Applications Application of Hadoop Map-Reduce on Several Domains Web/SNS data/bio/sensors/iot Many Domains: Triple Engine for WoD/ SA on Big Data Big Data Science 82

83 Situation Awareness on Big Data Big Data Science

84 Three layers for situation awareness Integration / Ontology Learning Data Mining / Machine Learning Reasoning / Prediction Inferred Metadata (Projection) Learned Relationship Metadata (Comprehension) Processed Information Entity Metadata (Perception) New Fact / Rules Semantics / Understanding / Insight Information Services for Data from Web/Smart Grid/Web/IOT/Cloud (World) Data Big Data Science

85 Layers for Situation Awareness on Smart Grid Services Projection Layer / Knowledge Extraction, Inference Comprehension Layer / Ontology Learning, Inference Perception Layer / Data Mining, Signal Processing Web Services / SOAP, RESTful Services Smart Grid Network / Sensors, Internet of Things Big Data Science

86 Active Situation Awareness Projection recommendtoparticipate TheEvent(Building, Event) needreplyto (ITM) checkhisevent (ITM) Comprehension (Situation) givehottopic (ITM,ATopicHisBlog) hasevent (Building, Event) israre(event) saycelebration (ITM, myblog) Perception Stand (People, Longline) isat (People, Building) Wrote (ITM, myblog) needreplyto (ITM) World Facebook Twitter Google Web Data Service 86 Big Data Science

87 A Framework for ASA on Big Data Big Data Infrastructure by Hadhoop and Data Mining Algorithm on it. Consistent Web APIs for ASA Upper Layer Big Data Infrastructure by Hadhoop - What is data format of SG? - How can get data from SG? - How Map between SG and Hive? - How can get data from Hive? - How map between Hive and Services? Smart Grid Infrastructure Big Data Science

88 Big Data Infrastructure To ASA Perception Layer Big Data Reasoner ETL Query ETL Map- Reduce Engine Events with Time and Location (ETL) Identification & Collector Big Data Science Big Data from SNS/Web/Sensors

89 Big Data Infrastructure <K,V> scheme for Event-Time-Location is S1: <E, <T,L>> S2: <T, <E,L>> S3: <L, <E,T>> where, E is event, T time and L location. Map-Reduce Function for Event, Time and Location Function: Retrieving E,T,L using Map-Reduce on Hadoop Input: SNS data set // set of words that are split Output: E/T/V related to Event //category of words set Map function for Event: for each word output <event@input, <T,L> pair> Reduce function for Event: for each event@input <T,L> pair set by temporal and spatial relationship reasoning output file reasoned <T,L> pair set - Map-Reduce function for the other element can be obtained the same fashion Big Data Science

90 Research Issues on Big Data Infra- Structure and Application Optimization of Big Data Infrastructure Locality/Network Aware Big Data Infrastructure Pipelining of Partitioning/Shuffling of Map- Reduce Considering Map-Oriented/Reduce-Oriented Operation Block & Job Scheduling for Optimization Big Data Science 90

91 Network-Aware Optimal Data Allocation for Map-Reduce Big Data Science

92 Why heavy network load? Data movement costs in map phrase: if the grouping data are distributed by Hadoop's random strategy, the shaded map tasks with either remote data access or queueing delay are the performance barriers; whereas if these data are evenly distributed, the MapReduce program can avoid these barriers. Intermediate data shuffling cost across racks in reduce phrase. TT j stands for a Task Tracker node j. P Ri:TTj stands for a partition P produced at TT j and hashed to reducer, Shuffling P R0:TT0 from TTO to TT3, P R1:TT2 from TT2 to TTO, and P R1:TT3 from TT3 to TTO, In addition, P R1:TT1 will be shuffled from TTl to TTO and P R0:TT3 will remain local to TT3. Network load: Off-Rack>rack-local>Node-local Big Data Science

93 Example of Minimizing Cost by Data Replacement Big Data Science

94 Research Problem The research problem is: To develop the mathematic models and theories; To develop learning algorithm, For the optimization of objective function defined in Eq. (1) with constrains (2) and (3). Big Data Science

95 Optimization Example It may show intuitively: Replicas of each data blocks are distributed into each main server node clusters. Data blocks whom share high similarity will be replaced closer. Big Data Science

96 References Tom White, Hadoop, OREILLLY, 2011 Srinath Perera, Thilina Gunarathne, Hadoop Map- Reduce Programming, Packt Publishing, 2013 J.H Jeong, Beginning Hadoop Programming: Development and Operations, Wiki Books, 2012 Big Data University, I. Paik, R. Sawa, K. Ofuji, and N. Yen, Introduction to Big Data Science, Graduate School Course, Big Data Science 96