I 3 Map Reduce for mining Big Data

Transcription

1 I 3 Map Reduce for mining Big Data P M Anupraba 1 S Parameswaran 2 1 (ME, Dept of CSE, Sree Sowdambika College Of Engg, Aruppukottai, India, [email protected]) 2 (Asst Prof, Dept of CSE, Sree Sowdambika College Of Engg, Aruppukottai, India, [email protected]) Abstract With the development of computer technology, there is a tremendous increase in the growth of data. A big problem has been encountered in various fields for making the full use of these large scale data which supports decision making. Data mining is the technique that can discovers new patterns from large data sets, where most classical data mining methods became out of reach in practice to handle such big data. Big data is large volume, heterogeneous (STRUCTURED, SEMI STRUCTURED & UN STRUCTURED) distributed data, which can be analyzed by MapReduce, a substructure of HDFS Platform. In this paper, a new model is proposed as i3 MapReduce, which acts as an efficient approach for refreshing materialized data by reducing re-computation process. It is based on a new Clustering algorithm with its features like,(i) Centralized- Environment, (ii) Membership id for each data,(iii) Data point in multiple cluster (iv) Easy handling of overlapped data-set. I assessed the final results according to the Online Experiment, using Amazon s Public Cloud and performance improvement Analysis tabulated between i3-mapreduce and other MapReduce. Keywords MapReduce; Fuzzy Clustering; Big Data; Public Cloud 1. INTRODUCTION Big Data became a big topic across nearly every area of IT. IDC defines Big Data technologies as a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data by enabling high-velocity capture, discovery, and/or analysis. There are three main characteristics of Big Data: the data itself, the analytics of the data, and the presentation of the results of the analytics. Then there are the products and services that can be wrapped around one or all of these Big Data elements. The digital universe itself, of course, comprises data all kinds of data. However, the vast majority of new data being generated is unstructured. This means that more often than not, we know little about the data, unless it is somehow characterized or tagged a practice that results in metadata. Metadata is one of the fastest-growing sub segments of the digital universe (though metadata itself is a small part of the digital universe overall). We believe that by 2020, a third of the data in the digital universe (more than 13,000 hexa bytes) will have Big Data value, but only if it is tagged and analysed. Big data and analytics offer the promise to transform risk management and decisionmaking, providing more information and more speed. They won t solve every problem, however, and with these new sources of information come new pressures to focus risk management activities and respond quickly to perceived dangers. Using big data, companies have the potential to better identify hidden risk and allow better root cause analysis. Risk managers can improve their ability to determine the probability of an event by leveraging metadata and using customer segmentation to identify risk factors. Big data can help develop better early warning indicators that will allow companies to mitigate risk more effectively. 2. EASE OF USE A Map Reduce program is composed of a Map function and a Reduce function[1], as shown in Fig.1.TheMap function takes a kv-pair<k1;v1>as input and computes zero or more intermediate kvpairs<k2;v2>.thenall<k2;v2>aregroupedbyk2.the Reduce function takes a K2 and a list of {V2} as input and computes the final output kv-pairs<k3;v3>. A MapReduce system(e.g.,apachehadoop) usually reads the input data of the MapReduce computation from and writes the final results to a distributed file system (e.g., HDFS), which divides a file into equal-sized (e.g.,64mb)blocks and stores the blocks across a cluster of machines. For a Map Reduce program, the MapReduce system runs a Job Tracker process on a master node to monitor the job progress, and a set of Task Tracker processes on worker nodesnto perform the actual Map and Reduce tasks. The Job Tracker starts a Map task per data block, and typically assigns it to the Task Tracker on the machine that holds the corresponding data block inorder to minimize communication overhead. Each Maptask calls the Map function for every input, and stores the intermediate kvpairson local disks. Intermediate results are shuffled to Reduce tasks according to a partition function(e.g., a hash function)on K2. After a Reduce task obtains and merges intermediate results from all Map Tasks and produces the final output results. Fig 1. MapReduce Implementation of KV-Pairs Volume: 02 Issue:

2 3. APACHE HIVE Hive is a component of Hortonworks Data Platform (HDP). Hive provides a SQL-like interface to data stored in HDP. The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data. and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. 3.1 HBase Apache HBase is the Hadoop database modeled after Google s BigTable. It is a sparse, distributed, persistent multi- dimensional sorted map. The map is indexed by a row key, column key, and a timestamp. Each value in the map is an un-interpreted array of bytes. Logical View 3.2 Hive + HBase Integration For Input/Output Format, getsplits(), etc underlying HBase classes are used. Column selection and certain filters can be pushed down. HBase tables can be used with other(hadoop native) tables and SQL constructs. Hive DDL operations are converted to HBase DDL operations via the client hook. All operations are performed by the client No two phase commit. 3.3 Schema Mapping Hive table + columns + column types <=>HBase table + column families (+ column qualifiers). Every field in Hive table is mapped in order to either The table key (using :key as selector) A column family (cf:) -> MAP fields in Hive A column (cf:cq) Hive table does not need to include all columns in HBase. 3.4 Type Mapping Recently added to Hive Previously all types were being converted to strings in HBase. Hive has: Primitive types: INT, STRING, BINARY, DATE, etc ARRAY MAP STRUCT HBase does not have types Bytes.toBytes() Table level property "hbase.table.default.storage.type = binary. Type mapping can be given per column after # 3.5 Bulk Load: Steps to bulk load: Sample source data for range partitioning Save sampling results to a file Run CLUSTER BY query using HiveHFileOutputFormat and TotalOrderPartitioner Import Hfiles into HBase table Ideal setup should be SET hive.hbase.bulk=true INSERT OVERWRITE TABLE web_table SELECT. 3.6 Filter Pushdown Idea is to pass down filter expressions to the storage layer to minimize scanned data. To access indexes at HDFS or HBase. Example: CREATE EXTERNAL TABLE users (userid LONG, STRING, ) STORED BY 'org.apache.hadoop.hive.hbase.hbasestoragehandler WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, ") SELECT... FROM users WHERE userid> and LIKE %@gmail.com ; - >scan.setstartrow(bytes.tobytes( )) 3.7 Filter Decomposition Optimizer pushes down the predicates to the query plan. Storage handlers can negotiate with the Hive optimizer to decompose the filter. Apache Hive Architecture Volume: 02 Issue:

3 x> 3 AND upper(y) = 'XYZ Handle x > 3, send upper(y) = XYZ as residual for Hive. Works with:key = 3, key > 3, etc key > 3 AND key < 100.Only works against constant expressions. 4. ITERATIVE COMPUTATION Iteration in computing is the repetition of a block of statements within a computer program. It can be used both as a general term, synonymous with repetition, and to describe a specific form of repetition with a mutable state. Confusingly, it may also refer to any repetition stated using an explicit repetition structure, regardless of mutability. It is applied using some algorithms like PageRank, GIM- V,Clustering etc. 4.1 PageRank: PageRank is an algorithm used by Google Search to rank websites in their search engine results. PageRank was named after Larry Page, one of the founders of Google. PageRank is a way of measuring the importance of website pages. According to Google: PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.it computes a ranking score for each vertex in a graph. After initializing all ranking scores, the computation performs a MapReduce job per iteration, as shown in Algorithm 2. i and j are vertex ids, Ni is the set of out-neighbor vertices of i, Ri is i s ranking score that is updated iteratively. j means concatenation. All Ri s are initialized to one.2 The Map instance on vertex i sends value Ri;j ¼ Ri=jNij to all its out-neighbors j, where jnij is the number of i s outneighbors. The Reduce instance on vertex j updates Rj by summing the Ri;j received from all its in-neighborsi, and applying a damping factor d. Algorithm 1. PageRank in MapReduce 4.2 GIM-V: Generalized Iterated Matrix-Vector multiplication (GIM-V) is an abstraction of many iterative graph mining operations (e.g., PageRank, spectral clustering, diameter estimation, connected components). These graph mining algorithms can be generally represented by operating on an nn matrix M and a vector v of size n. Suppose both the matrix and the vector are divided into sub-blocks. Let mi;j denote the ði;jþth block of M and vj denote the jth block of v. The computation steps are similar to those of the matrixvector multiplication and can be abstracted into three operations: (1) mvi;j = combine2(mi;j;v j); (2) v0 i = combinealli({mvi;j}); and (3) vi = assign(vi;v 0 i). We can compare combine2 to the multiplication between mi;j and vj, and compare combineall to the sum of mvi;j for row i. Algorithm 4 shows the MapReduce implementation with two jobs for each iteration. The first job assigns vector block vj to multiple matrix blocks mi;j (8i) and performs combine2(mi;j;v j) to obtain mvi;j. The second job groups the mvi;j and vi on the same i, performs the combineall ({mvi;j}) operation, and updates vi using assign(vi;v 0 i). Algorithm 2.GIM-V in MapReduce 4.3 Fuzzy C-Means Clustering : Fuzzy c-means (FCM) is a method of clustering which allows one piece of data to belong to two or more clusters. Fixed number of clusters. Each data point belongs to the cluster corresponding to the closest centroid. Clusters are fuzzy sets. Membership degree of a point can be any number between 0 and 1.Sum of all degrees for a point must add up to Reasoning with Fuzzy sets: There are two assumptions that are essential for the use of formal set theory: For any element and a set belonging to some universe, the element is either a member of the set or else it is a member of the complement of that set. An element cannot belong to both a set and also to its complement. Clustering involves the task of dividing data points into homogeneous classes or clusters so that items in the same class are as similar as possible and items in different classes are as dissimilar as possible. Clustering can also be thought of as a form of data compression, where a large number of samples are converted into a small number of representative prototypes or clusters. Depending on the data and the application, different types of similarity measures may be used to identify classes, where the similarity measure controls how the clusters are formed. Some examples of values that can be used as similarity measures include distance, connectivity, and intensity.in Volume: 02 Issue:

4 non-fuzzy or hard clustering, data is divided into crisp clusters, where each data point belongs to exactly one cluster. Membership Matrix M This iteration will stop when, where is a termination criterion between 0 and 1, whereas k are the iteration steps. This procedure converges to a local minimum or a saddle point ofj m. The algorithm is composed of the following steps: Algorithm: FCM-Algorithm In fuzzy clustering, the data points can belong to more than one cluster, and associated with each of the points are membership grades which indicate the degree to which the data points belong to the different clusters. This chapter demonstrates the fuzzy c-means clustering algorithm.the FCM algorithm attempts to partition a finite collection of elements X={,,..., } into a collection of c fuzzy clusters with respect to some given criterion. Given a finite set of data, the algorithm returns a list of c cluster centers V, such that V =, i =1, 2,..., c and a partition matrix U such that U =, i =1,..., c, j =1,..., n, where is a numerical value in[0, 1] that tells the degree to which the element belongs to the i-th cluster. Fuzzy Clustering Looking at the picture, we may identify two clusters in proximity of the two data concentrations. We will refer to them using A and B. In the first approach shown in this tutorial - the k-means algorithm - we associated each datum to a specific centroid; therefore, this membership function looked like this: Objective function: with the update of membership u ij and the cluster centers c j by:, In the FCM approach, instead, the same given datum does not belong exclusively to a well defined cluster, but it can be placed in a middle way. In this case, the membership function follows a smoother line to indicate that every datum may belong to several clusters with different values of the membership coefficient., Volume: 02 Issue:

5 In the figure above, the datum shown as a red marked spot belongs more to the B cluster rather than the A cluster. The value 0.2 of m indicates the degree of membership to A for such datum. Now, instead of using a graphical representation, we introduce a matrix U whose factors are the ones taken from the membership functions: (a) (b) The number of rows and columns depends on how many data and clusters we are considering. More exactly we have C = 2 columns (C = 2 clusters) and N rows, where C is the total number of clusters and N is the total number of data. The generic element is so indicated: u ij. In the examples above we have considered the k-means (a) and FCM (b) cases. We can notice that in the first case (a) the coefficients are always unitary. It is so to indicate the fact that each datum can belong only to one cluster. Other properties are shown below: 5. CONFIGURATION Install a stable release of Hive by downloading a tarball, or you can download the source code and build Hive from that. 5.1 Requirements Java 1.7 Note: Hive versions 1.2 onward require Java 1.7 or newer. Hive versions 0.14 to 1.1 work with Java 1.6 as well. Users are strongly advised to start moving to Java 1.8 (see HIVE- 8607). Hadoop 2.x (preferred), 1.x (not supported by Hive onward). Hive versions up to 0.13 also supported Hadoop 0.20.x, 0.23.x. Hive is commonly used in production Linux and Windows environment. Mac is a commonly used development environment. 5.2 Configuration Management Overview Hive by default gets its configuration from <install-dir>/conf/hivedefault.xml.the location of the Hive configuration directory can be changed by setting the HIVE_CONF_DIR environment variable.configuration variables can be changed by (re-)defining them in <install-dir>/conf/hive-site.xmllog4j configuration is stored in <installdir>/conf/hive-log4j.properties. Hive configuration is an overlay on top of Hadoop it inherits the Hadoop configuration variables by default.hive configuration can be manipulated by:editing hive-site.xml and defining any desired variables (including Hadoop variables) in it.using the set command.invoking Hive (deprecated), Beeline or HiveServer2 using the syntax: $ bin/hive --hiveconf x1=y1 --hiveconf x2=y2 //this sets the variables x1 and x2 to y1 and y2 respectively $ bin/hiveserver2 --hiveconf x1=y1 --hiveconf x2=y2 //this sets server-side variables x1 and x2 to y1 and y2 respectively $ bin/beeline --hiveconf x1=y1 --hiveconf x2=y2 //this sets client-side variables x1 and x2 to y1 and y2 respectively. Setting the HIVE_OPTS environment variable to "-- hiveconf x1=y1 --hiveconf x2=y2" which does the same as above.hive compiler generates map-reduce jobs for most queries. These jobs are then submitted to the Map-Reduce cluster indicated by the variable:mapred.job.tracker While this usually points to a map-reduce cluster with multiple nodes, Hadoop also offers a nifty option to run map-reduce jobs locally on the user's workstation. This can be very useful to run queries over small data sets in such cases local mode execution is usually significantly faster than submitting jobs to a large cluster. Data is accessed transparently from HDFS. Conversely, local mode only runs with one reducer and can be very slow processing larger data sets.starting with release 0.7, Hive fully supports local mode execution. To enable this, the user can enable the following option: hive> SET mapred.job.tracker=local; In addition, mapred.local.dir should point to a path that's valid on the local machine(for example /tmp/<username>/mapred/local).(other wise, the user will get an exception allocating local disk space.) Starting with release 0.7, Hive also supports a mode to run map-reduce jobs in local-mode automatically. The relevant options Volume: 02 Issue:

6 hive.exec.mode.local.auto,hive.exec.mode.local.auto.inputbytes.max,and hive.exec. mode.local.auto.tasks.max: hive> SET hive.exec.mode.local.auto=false; Note that this feature is disabled by default. If enabled, Hive analyzes the size of each map-reduce job in a query and may run it locally if the following thresholds are satisfied:the total input size of the job is lower than: hive.exec.mode.local.auto.inputbytes.max (128MB by default)the total number of map-tasks is less than: hive.exec.mode.local.auto.tasks.max (4 by default).the total number of reduce tasks required is 1 or 0.So for queries over small data sets, or for queries with multiple map-reduce jobs where the input to subsequent jobs is substantially smaller (because of reduction/filtering in the prior job), jobs may be run locally.note that there may be differences in the runtime environment of Hadoop server nodes and the machine running the Hive client (because of different jvm versions or different software libraries). This can cause unexpected behavior/errors while running in local mode. Also note that local mode execution is done in a separate, child jvm (of the Hive client). If the user so wishes, the maximum amount of memory for this child jvm can be controlled via the option hive. mapred. local.mem. By default, it's set to zero, in which case Hive lets Hadoop determine the default memory limits of the child jvm. The intent of providing a separate configuration file is to enable administrators to centralize execution log capture if desired (on a NFS file server for example). Execution logs are invaluable for debugging run-time errors. 5.4Runtime Configuration: Hive queries are executed using map-reduce queries and, therefore, the behavior of such queries can be controlled by the Hadoop configuration variables.thehivecli (deprecated) and Beeline command 'SET' can be used to set any Hadoop (or Hive) configuration variable. For example: beeline>setmapred.job.tracker=myhost.mycompany.com: 50030; beeline> SET -v; The latter shows all the current settings. Without the - v option only the variables that differ from the base Hadoop configuration are displayed. 5.3 Hive Logging Hive uses log4j for logging. By default logs are not emitted to the console by the CLI. The default logging level is WARN for Hive releases prior to Starting with Hive , the default logging level is INFO./tmp/<user.name>/hive.log Note: In local mode, prior to Hive the log file name was ".log" instead of "hive.log". This bug was fixed in release (see HIVE-5528 and HIVE-5676). To configure a different log location, set hive.log.dir in $HIVE_HOME/conf/hivelog4j.properties. Make sure the directory has the sticky bit set (chmod 1777 <dir>). When local mode (using mapred.job.tracker=local), Hadoop/Hive execution logs are produced on the client machine itself. Starting with release 0.6 Hive uses the hive-execlog4j.properties (falling back to hivelog4j.properties only if it's missing) to determine where these logs are delivered by default. The default configuration file produces one log file per query executed in local mode and stores it under /tmp/<user.name>. 6. CONCLUSION AND FUTURE WORK In this paper, experience of designing and implementing i3 MapReduce is discussed for iterative MapReduce computations. MapReduce extends the envelop of MapReduce to more classes of applications. We have also presented the results of a set of applications with voluminous data sets. Some of the benchmarks performed with this uses core cluster. The results, including some of the complex iterative applications such as MDS, indicate that i3mapreduce performs and scales well for many iterative MapReduce computations and even produces comparable results with classic parallel runtimes. In Future, this work is continuing on the functions like, (i)define a sound methodology for selecting centroids (ii)fine tune some parameters such as the decrease of the membership as a function of cost(iii) Improve the software implementation.(iv)new method to compute fuzzy membership(v)clusters based on density variation(vi)good results in difficult problems. REFERENCES [1] J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," Commun. ACM, vol. 51, pp , Volume: 02 Issue:

7 [2] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, "Dryad: distributed data-parallel programs from sequential building blocks," presented at the Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, Lisbon, Portugal, [3] Bezdek J, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, USA, [4] Dunn JC, A fuzzy relative of the ISODATA process andits use in detecting compact, well separated clusters, Journal of Cybernetics, 3(3):32-57, [5] Dae-Won Kim, Kwang H. Lee, Doheon Lee, Fuzzy cluster validation index based on inter-cluster proximity. Pattern Recognition Letters, 24: , [6] Pal N.R, Pal K, Keller J.M. and Bezdek J.C, A Possibilistic Fuzzy c-means Clustering Algorithm, IEEE Transactions on Fuzzy Systems, Vol. 13, No. 4, Pp , [7] R. Krishnapuramamd J.M. Keller, A possibilistic approach to clustering, IEEE Trans. Fuzzy Systems, Vol. 1, Pp , [8] VudaSreenivasarao and Dr.S. Vidyavathi, Comparative Analysis of Fuzzy C- Mean and Modified Fuzzy Possibilistic C -Mean Algorithms in Data Mining, IJCST Vol. 1, No. 1, Pp , [9] J. C. Dunn (1973): "A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters", Journal of Cybernetics 3: [7] J. C. Bezdek (1981): "Pattern Recognition with Fuzzy Objective Function Algorithms", Plenum Press, New York. [10] Mohamed FadhelSaad and Adel M. Alimi, Modified Fuzzy Possibilistic C-means, Proceedings of the International MultiConference of Engineers and Computer Scientists 2009 Vol I IMECS 2009, March 18-20, 2009, Hong Kong. [11] Apache Hadoop. Available at [12] Facebook Lexicon at [13] Hive wiki at Volume: 02 Issue: