I 3 Map Reduce for mining Big Data

Size: px
Start display at page:

Download "I 3 Map Reduce for mining Big Data"

Transcription

1 I 3 Map Reduce for mining Big Data P M Anupraba 1 S Parameswaran 2 1 (ME, Dept of CSE, Sree Sowdambika College Of Engg, Aruppukottai, India, [email protected]) 2 (Asst Prof, Dept of CSE, Sree Sowdambika College Of Engg, Aruppukottai, India, [email protected]) Abstract With the development of computer technology, there is a tremendous increase in the growth of data. A big problem has been encountered in various fields for making the full use of these large scale data which supports decision making. Data mining is the technique that can discovers new patterns from large data sets, where most classical data mining methods became out of reach in practice to handle such big data. Big data is large volume, heterogeneous (STRUCTURED, SEMI STRUCTURED & UN STRUCTURED) distributed data, which can be analyzed by MapReduce, a substructure of HDFS Platform. In this paper, a new model is proposed as i3 MapReduce, which acts as an efficient approach for refreshing materialized data by reducing re-computation process. It is based on a new Clustering algorithm with its features like,(i) Centralized- Environment, (ii) Membership id for each data,(iii) Data point in multiple cluster (iv) Easy handling of overlapped data-set. I assessed the final results according to the Online Experiment, using Amazon s Public Cloud and performance improvement Analysis tabulated between i3-mapreduce and other MapReduce. Keywords MapReduce; Fuzzy Clustering; Big Data; Public Cloud 1. INTRODUCTION Big Data became a big topic across nearly every area of IT. IDC defines Big Data technologies as a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data by enabling high-velocity capture, discovery, and/or analysis. There are three main characteristics of Big Data: the data itself, the analytics of the data, and the presentation of the results of the analytics. Then there are the products and services that can be wrapped around one or all of these Big Data elements. The digital universe itself, of course, comprises data all kinds of data. However, the vast majority of new data being generated is unstructured. This means that more often than not, we know little about the data, unless it is somehow characterized or tagged a practice that results in metadata. Metadata is one of the fastest-growing sub segments of the digital universe (though metadata itself is a small part of the digital universe overall). We believe that by 2020, a third of the data in the digital universe (more than 13,000 hexa bytes) will have Big Data value, but only if it is tagged and analysed. Big data and analytics offer the promise to transform risk management and decisionmaking, providing more information and more speed. They won t solve every problem, however, and with these new sources of information come new pressures to focus risk management activities and respond quickly to perceived dangers. Using big data, companies have the potential to better identify hidden risk and allow better root cause analysis. Risk managers can improve their ability to determine the probability of an event by leveraging metadata and using customer segmentation to identify risk factors. Big data can help develop better early warning indicators that will allow companies to mitigate risk more effectively. 2. EASE OF USE A Map Reduce program is composed of a Map function and a Reduce function[1], as shown in Fig.1.TheMap function takes a kv-pair<k1;v1>as input and computes zero or more intermediate kvpairs<k2;v2>.thenall<k2;v2>aregroupedbyk2.the Reduce function takes a K2 and a list of {V2} as input and computes the final output kv-pairs<k3;v3>. A MapReduce system(e.g.,apachehadoop) usually reads the input data of the MapReduce computation from and writes the final results to a distributed file system (e.g., HDFS), which divides a file into equal-sized (e.g.,64mb)blocks and stores the blocks across a cluster of machines. For a Map Reduce program, the MapReduce system runs a Job Tracker process on a master node to monitor the job progress, and a set of Task Tracker processes on worker nodesnto perform the actual Map and Reduce tasks. The Job Tracker starts a Map task per data block, and typically assigns it to the Task Tracker on the machine that holds the corresponding data block inorder to minimize communication overhead. Each Maptask calls the Map function for every input, and stores the intermediate kvpairson local disks. Intermediate results are shuffled to Reduce tasks according to a partition function(e.g., a hash function)on K2. After a Reduce task obtains and merges intermediate results from all Map Tasks and produces the final output results. Fig 1. MapReduce Implementation of KV-Pairs Volume: 02 Issue:

2 3. APACHE HIVE Hive is a component of Hortonworks Data Platform (HDP). Hive provides a SQL-like interface to data stored in HDP. The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data. and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. 3.1 HBase Apache HBase is the Hadoop database modeled after Google s BigTable. It is a sparse, distributed, persistent multi- dimensional sorted map. The map is indexed by a row key, column key, and a timestamp. Each value in the map is an un-interpreted array of bytes. Logical View 3.2 Hive + HBase Integration For Input/Output Format, getsplits(), etc underlying HBase classes are used. Column selection and certain filters can be pushed down. HBase tables can be used with other(hadoop native) tables and SQL constructs. Hive DDL operations are converted to HBase DDL operations via the client hook. All operations are performed by the client No two phase commit. 3.3 Schema Mapping Hive table + columns + column types <=>HBase table + column families (+ column qualifiers). Every field in Hive table is mapped in order to either The table key (using :key as selector) A column family (cf:) -> MAP fields in Hive A column (cf:cq) Hive table does not need to include all columns in HBase. 3.4 Type Mapping Recently added to Hive Previously all types were being converted to strings in HBase. Hive has: Primitive types: INT, STRING, BINARY, DATE, etc ARRAY MAP STRUCT HBase does not have types Bytes.toBytes() Table level property "hbase.table.default.storage.type = binary. Type mapping can be given per column after # 3.5 Bulk Load: Steps to bulk load: Sample source data for range partitioning Save sampling results to a file Run CLUSTER BY query using HiveHFileOutputFormat and TotalOrderPartitioner Import Hfiles into HBase table Ideal setup should be SET hive.hbase.bulk=true INSERT OVERWRITE TABLE web_table SELECT. 3.6 Filter Pushdown Idea is to pass down filter expressions to the storage layer to minimize scanned data. To access indexes at HDFS or HBase. Example: CREATE EXTERNAL TABLE users (userid LONG, STRING, ) STORED BY 'org.apache.hadoop.hive.hbase.hbasestoragehandler WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, ") SELECT... FROM users WHERE userid> and LIKE %@gmail.com ; - >scan.setstartrow(bytes.tobytes( )) 3.7 Filter Decomposition Optimizer pushes down the predicates to the query plan. Storage handlers can negotiate with the Hive optimizer to decompose the filter. Apache Hive Architecture Volume: 02 Issue:

3 x> 3 AND upper(y) = 'XYZ Handle x > 3, send upper(y) = XYZ as residual for Hive. Works with:key = 3, key > 3, etc key > 3 AND key < 100.Only works against constant expressions. 4. ITERATIVE COMPUTATION Iteration in computing is the repetition of a block of statements within a computer program. It can be used both as a general term, synonymous with repetition, and to describe a specific form of repetition with a mutable state. Confusingly, it may also refer to any repetition stated using an explicit repetition structure, regardless of mutability. It is applied using some algorithms like PageRank, GIM- V,Clustering etc. 4.1 PageRank: PageRank is an algorithm used by Google Search to rank websites in their search engine results. PageRank was named after Larry Page, one of the founders of Google. PageRank is a way of measuring the importance of website pages. According to Google: PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.it computes a ranking score for each vertex in a graph. After initializing all ranking scores, the computation performs a MapReduce job per iteration, as shown in Algorithm 2. i and j are vertex ids, Ni is the set of out-neighbor vertices of i, Ri is i s ranking score that is updated iteratively. j means concatenation. All Ri s are initialized to one.2 The Map instance on vertex i sends value Ri;j ¼ Ri=jNij to all its out-neighbors j, where jnij is the number of i s outneighbors. The Reduce instance on vertex j updates Rj by summing the Ri;j received from all its in-neighborsi, and applying a damping factor d. Algorithm 1. PageRank in MapReduce 4.2 GIM-V: Generalized Iterated Matrix-Vector multiplication (GIM-V) is an abstraction of many iterative graph mining operations (e.g., PageRank, spectral clustering, diameter estimation, connected components). These graph mining algorithms can be generally represented by operating on an nn matrix M and a vector v of size n. Suppose both the matrix and the vector are divided into sub-blocks. Let mi;j denote the ði;jþth block of M and vj denote the jth block of v. The computation steps are similar to those of the matrixvector multiplication and can be abstracted into three operations: (1) mvi;j = combine2(mi;j;v j); (2) v0 i = combinealli({mvi;j}); and (3) vi = assign(vi;v 0 i). We can compare combine2 to the multiplication between mi;j and vj, and compare combineall to the sum of mvi;j for row i. Algorithm 4 shows the MapReduce implementation with two jobs for each iteration. The first job assigns vector block vj to multiple matrix blocks mi;j (8i) and performs combine2(mi;j;v j) to obtain mvi;j. The second job groups the mvi;j and vi on the same i, performs the combineall ({mvi;j}) operation, and updates vi using assign(vi;v 0 i). Algorithm 2.GIM-V in MapReduce 4.3 Fuzzy C-Means Clustering : Fuzzy c-means (FCM) is a method of clustering which allows one piece of data to belong to two or more clusters. Fixed number of clusters. Each data point belongs to the cluster corresponding to the closest centroid. Clusters are fuzzy sets. Membership degree of a point can be any number between 0 and 1.Sum of all degrees for a point must add up to Reasoning with Fuzzy sets: There are two assumptions that are essential for the use of formal set theory: For any element and a set belonging to some universe, the element is either a member of the set or else it is a member of the complement of that set. An element cannot belong to both a set and also to its complement. Clustering involves the task of dividing data points into homogeneous classes or clusters so that items in the same class are as similar as possible and items in different classes are as dissimilar as possible. Clustering can also be thought of as a form of data compression, where a large number of samples are converted into a small number of representative prototypes or clusters. Depending on the data and the application, different types of similarity measures may be used to identify classes, where the similarity measure controls how the clusters are formed. Some examples of values that can be used as similarity measures include distance, connectivity, and intensity.in Volume: 02 Issue:

4 non-fuzzy or hard clustering, data is divided into crisp clusters, where each data point belongs to exactly one cluster. Membership Matrix M This iteration will stop when, where is a termination criterion between 0 and 1, whereas k are the iteration steps. This procedure converges to a local minimum or a saddle point ofj m. The algorithm is composed of the following steps: Algorithm: FCM-Algorithm In fuzzy clustering, the data points can belong to more than one cluster, and associated with each of the points are membership grades which indicate the degree to which the data points belong to the different clusters. This chapter demonstrates the fuzzy c-means clustering algorithm.the FCM algorithm attempts to partition a finite collection of elements X={,,..., } into a collection of c fuzzy clusters with respect to some given criterion. Given a finite set of data, the algorithm returns a list of c cluster centers V, such that V =, i =1, 2,..., c and a partition matrix U such that U =, i =1,..., c, j =1,..., n, where is a numerical value in[0, 1] that tells the degree to which the element belongs to the i-th cluster. Fuzzy Clustering Looking at the picture, we may identify two clusters in proximity of the two data concentrations. We will refer to them using A and B. In the first approach shown in this tutorial - the k-means algorithm - we associated each datum to a specific centroid; therefore, this membership function looked like this: Objective function: with the update of membership u ij and the cluster centers c j by:, In the FCM approach, instead, the same given datum does not belong exclusively to a well defined cluster, but it can be placed in a middle way. In this case, the membership function follows a smoother line to indicate that every datum may belong to several clusters with different values of the membership coefficient., Volume: 02 Issue:

5 In the figure above, the datum shown as a red marked spot belongs more to the B cluster rather than the A cluster. The value 0.2 of m indicates the degree of membership to A for such datum. Now, instead of using a graphical representation, we introduce a matrix U whose factors are the ones taken from the membership functions: (a) (b) The number of rows and columns depends on how many data and clusters we are considering. More exactly we have C = 2 columns (C = 2 clusters) and N rows, where C is the total number of clusters and N is the total number of data. The generic element is so indicated: u ij. In the examples above we have considered the k-means (a) and FCM (b) cases. We can notice that in the first case (a) the coefficients are always unitary. It is so to indicate the fact that each datum can belong only to one cluster. Other properties are shown below: 5. CONFIGURATION Install a stable release of Hive by downloading a tarball, or you can download the source code and build Hive from that. 5.1 Requirements Java 1.7 Note: Hive versions 1.2 onward require Java 1.7 or newer. Hive versions 0.14 to 1.1 work with Java 1.6 as well. Users are strongly advised to start moving to Java 1.8 (see HIVE- 8607). Hadoop 2.x (preferred), 1.x (not supported by Hive onward). Hive versions up to 0.13 also supported Hadoop 0.20.x, 0.23.x. Hive is commonly used in production Linux and Windows environment. Mac is a commonly used development environment. 5.2 Configuration Management Overview Hive by default gets its configuration from <install-dir>/conf/hivedefault.xml.the location of the Hive configuration directory can be changed by setting the HIVE_CONF_DIR environment variable.configuration variables can be changed by (re-)defining them in <install-dir>/conf/hive-site.xmllog4j configuration is stored in <installdir>/conf/hive-log4j.properties. Hive configuration is an overlay on top of Hadoop it inherits the Hadoop configuration variables by default.hive configuration can be manipulated by:editing hive-site.xml and defining any desired variables (including Hadoop variables) in it.using the set command.invoking Hive (deprecated), Beeline or HiveServer2 using the syntax: $ bin/hive --hiveconf x1=y1 --hiveconf x2=y2 //this sets the variables x1 and x2 to y1 and y2 respectively $ bin/hiveserver2 --hiveconf x1=y1 --hiveconf x2=y2 //this sets server-side variables x1 and x2 to y1 and y2 respectively $ bin/beeline --hiveconf x1=y1 --hiveconf x2=y2 //this sets client-side variables x1 and x2 to y1 and y2 respectively. Setting the HIVE_OPTS environment variable to "-- hiveconf x1=y1 --hiveconf x2=y2" which does the same as above.hive compiler generates map-reduce jobs for most queries. These jobs are then submitted to the Map-Reduce cluster indicated by the variable:mapred.job.tracker While this usually points to a map-reduce cluster with multiple nodes, Hadoop also offers a nifty option to run map-reduce jobs locally on the user's workstation. This can be very useful to run queries over small data sets in such cases local mode execution is usually significantly faster than submitting jobs to a large cluster. Data is accessed transparently from HDFS. Conversely, local mode only runs with one reducer and can be very slow processing larger data sets.starting with release 0.7, Hive fully supports local mode execution. To enable this, the user can enable the following option: hive> SET mapred.job.tracker=local; In addition, mapred.local.dir should point to a path that's valid on the local machine(for example /tmp/<username>/mapred/local).(other wise, the user will get an exception allocating local disk space.) Starting with release 0.7, Hive also supports a mode to run map-reduce jobs in local-mode automatically. The relevant options Volume: 02 Issue:

6 hive.exec.mode.local.auto,hive.exec.mode.local.auto.inputbytes.max,and hive.exec. mode.local.auto.tasks.max: hive> SET hive.exec.mode.local.auto=false; Note that this feature is disabled by default. If enabled, Hive analyzes the size of each map-reduce job in a query and may run it locally if the following thresholds are satisfied:the total input size of the job is lower than: hive.exec.mode.local.auto.inputbytes.max (128MB by default)the total number of map-tasks is less than: hive.exec.mode.local.auto.tasks.max (4 by default).the total number of reduce tasks required is 1 or 0.So for queries over small data sets, or for queries with multiple map-reduce jobs where the input to subsequent jobs is substantially smaller (because of reduction/filtering in the prior job), jobs may be run locally.note that there may be differences in the runtime environment of Hadoop server nodes and the machine running the Hive client (because of different jvm versions or different software libraries). This can cause unexpected behavior/errors while running in local mode. Also note that local mode execution is done in a separate, child jvm (of the Hive client). If the user so wishes, the maximum amount of memory for this child jvm can be controlled via the option hive. mapred. local.mem. By default, it's set to zero, in which case Hive lets Hadoop determine the default memory limits of the child jvm. The intent of providing a separate configuration file is to enable administrators to centralize execution log capture if desired (on a NFS file server for example). Execution logs are invaluable for debugging run-time errors. 5.4Runtime Configuration: Hive queries are executed using map-reduce queries and, therefore, the behavior of such queries can be controlled by the Hadoop configuration variables.thehivecli (deprecated) and Beeline command 'SET' can be used to set any Hadoop (or Hive) configuration variable. For example: beeline>setmapred.job.tracker=myhost.mycompany.com: 50030; beeline> SET -v; The latter shows all the current settings. Without the - v option only the variables that differ from the base Hadoop configuration are displayed. 5.3 Hive Logging Hive uses log4j for logging. By default logs are not emitted to the console by the CLI. The default logging level is WARN for Hive releases prior to Starting with Hive , the default logging level is INFO./tmp/<user.name>/hive.log Note: In local mode, prior to Hive the log file name was ".log" instead of "hive.log". This bug was fixed in release (see HIVE-5528 and HIVE-5676). To configure a different log location, set hive.log.dir in $HIVE_HOME/conf/hivelog4j.properties. Make sure the directory has the sticky bit set (chmod 1777 <dir>). When local mode (using mapred.job.tracker=local), Hadoop/Hive execution logs are produced on the client machine itself. Starting with release 0.6 Hive uses the hive-execlog4j.properties (falling back to hivelog4j.properties only if it's missing) to determine where these logs are delivered by default. The default configuration file produces one log file per query executed in local mode and stores it under /tmp/<user.name>. 6. CONCLUSION AND FUTURE WORK In this paper, experience of designing and implementing i3 MapReduce is discussed for iterative MapReduce computations. MapReduce extends the envelop of MapReduce to more classes of applications. We have also presented the results of a set of applications with voluminous data sets. Some of the benchmarks performed with this uses core cluster. The results, including some of the complex iterative applications such as MDS, indicate that i3mapreduce performs and scales well for many iterative MapReduce computations and even produces comparable results with classic parallel runtimes. In Future, this work is continuing on the functions like, (i)define a sound methodology for selecting centroids (ii)fine tune some parameters such as the decrease of the membership as a function of cost(iii) Improve the software implementation.(iv)new method to compute fuzzy membership(v)clusters based on density variation(vi)good results in difficult problems. REFERENCES [1] J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," Commun. ACM, vol. 51, pp , Volume: 02 Issue:

7 [2] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, "Dryad: distributed data-parallel programs from sequential building blocks," presented at the Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, Lisbon, Portugal, [3] Bezdek J, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, USA, [4] Dunn JC, A fuzzy relative of the ISODATA process andits use in detecting compact, well separated clusters, Journal of Cybernetics, 3(3):32-57, [5] Dae-Won Kim, Kwang H. Lee, Doheon Lee, Fuzzy cluster validation index based on inter-cluster proximity. Pattern Recognition Letters, 24: , [6] Pal N.R, Pal K, Keller J.M. and Bezdek J.C, A Possibilistic Fuzzy c-means Clustering Algorithm, IEEE Transactions on Fuzzy Systems, Vol. 13, No. 4, Pp , [7] R. Krishnapuramamd J.M. Keller, A possibilistic approach to clustering, IEEE Trans. Fuzzy Systems, Vol. 1, Pp , [8] VudaSreenivasarao and Dr.S. Vidyavathi, Comparative Analysis of Fuzzy C- Mean and Modified Fuzzy Possibilistic C -Mean Algorithms in Data Mining, IJCST Vol. 1, No. 1, Pp , [9] J. C. Dunn (1973): "A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters", Journal of Cybernetics 3: [7] J. C. Bezdek (1981): "Pattern Recognition with Fuzzy Objective Function Algorithms", Plenum Press, New York. [10] Mohamed FadhelSaad and Adel M. Alimi, Modified Fuzzy Possibilistic C-means, Proceedings of the International MultiConference of Engineers and Computer Scientists 2009 Vol I IMECS 2009, March 18-20, 2009, Hong Kong. [11] Apache Hadoop. Available at [12] Facebook Lexicon at [13] Hive wiki at Volume: 02 Issue:

Integration of Apache Hive and HBase

Integration of Apache Hive and HBase Integration of Apache Hive and HBase Enis Soztutar enis [at] apache [dot] org @enissoz Page 1 About Me User and committer of Hadoop since 2007 Contributor to Apache Hadoop, HBase, Hive and Gora Joined

More information

Big Data and Scripting Systems build on top of Hadoop

Big Data and Scripting Systems build on top of Hadoop Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform interactive execution of map reduce jobs Pig is the name of the system Pig Latin is the

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! [email protected]

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Comparing SQL and NOSQL databases

Comparing SQL and NOSQL databases COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2015 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

How to Install and Configure EBF15328 for MapR 4.0.1 or 4.0.2 with MapReduce v1

How to Install and Configure EBF15328 for MapR 4.0.1 or 4.0.2 with MapReduce v1 How to Install and Configure EBF15328 for MapR 4.0.1 or 4.0.2 with MapReduce v1 1993-2015 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic,

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Scaling Up HBase, Hive, Pegasus

Scaling Up HBase, Hive, Pegasus CSE 6242 A / CS 4803 DVA Mar 7, 2013 Scaling Up HBase, Hive, Pegasus Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko,

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015 COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Graph Processing and Social Networks

Graph Processing and Social Networks Graph Processing and Social Networks Presented by Shu Jiayu, Yang Ji Department of Computer Science and Engineering The Hong Kong University of Science and Technology 2015/4/20 1 Outline Background Graph

More information

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2 Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

More information

Using distributed technologies to analyze Big Data

Using distributed technologies to analyze Big Data Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/

More information

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT

More information

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12 Introduction to NoSQL Databases and MapReduce Tore Risch Information Technology Uppsala University 2014-05-12 What is a NoSQL Database? 1. A key/value store Basic index manager, no complete query language

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

The Hadoop Eco System Shanghai Data Science Meetup

The Hadoop Eco System Shanghai Data Science Meetup The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related

More information

Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute

More information

American International Journal of Research in Science, Technology, Engineering & Mathematics

American International Journal of Research in Science, Technology, Engineering & Mathematics American International Journal of Research in Science, Technology, Engineering & Mathematics Available online at http://www.iasir.net ISSN (Print): 2328-3491, ISSN (Online): 2328-3580, ISSN (CD-ROM): 2328-3629

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

Efficient Processing of XML Documents in Hadoop Map Reduce

Efficient Processing of XML Documents in Hadoop Map Reduce Efficient Processing of Documents in Hadoop Map Reduce Dmitry Vasilenko, Mahesh Kurapati Business Analytics IBM Chicago, USA [email protected], [email protected] Abstract has dominated the enterprise

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

CASE STUDY OF HIVE USING HADOOP 1

CASE STUDY OF HIVE USING HADOOP 1 CASE STUDY OF HIVE USING HADOOP 1 Sai Prasad Potharaju, 2 Shanmuk Srinivas A, 3 Ravi Kumar Tirandasu 1,2,3 SRES COE,Department of er Engineering, Kopargaon,Maharashtra, India 1 [email protected]

More information

A Study of Data Management Technology for Handling Big Data

A Study of Data Management Technology for Handling Big Data Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 9, September 2014,

More information

Architecting the Future of Big Data

Architecting the Future of Big Data Hive ODBC Driver User Guide Revised: July 22, 2013 2012-2013 Hortonworks Inc. All Rights Reserved. Parts of this Program and Documentation include proprietary software and content that is copyrighted and

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

More information

Best Practices for Hadoop Data Analysis with Tableau

Best Practices for Hadoop Data Analysis with Tableau Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks

More information

Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Hadoop SNS. renren.com. Saturday, December 3, 11

Hadoop SNS. renren.com. Saturday, December 3, 11 Hadoop SNS renren.com Saturday, December 3, 11 2.2 190 40 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

What is Analytic Infrastructure and Why Should You Care?

What is Analytic Infrastructure and Why Should You Care? What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group [email protected] ABSTRACT We define analytic infrastructure to be the services,

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]

More information

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385 brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 1 Hadoop in a heartbeat 3 2 Introduction to YARN 22 PART 2 DATA LOGISTICS...59 3 Data serialization working with text and beyond 61 4 Organizing and

More information

Policy-based Pre-Processing in Hadoop

Policy-based Pre-Processing in Hadoop Policy-based Pre-Processing in Hadoop Yi Cheng, Christian Schaefer Ericsson Research Stockholm, Sweden [email protected], [email protected] Abstract While big data analytics provides

More information

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

More information

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

Creating a universe on Hive with Hortonworks HDP 2.0

Creating a universe on Hive with Hortonworks HDP 2.0 Creating a universe on Hive with Hortonworks HDP 2.0 Learn how to create an SAP BusinessObjects Universe on top of Apache Hive 2 using the Hortonworks HDP 2.0 distribution Author(s): Company: Ajay Singh

More information

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1 Spark ΕΡΓΑΣΤΗΡΙΟ 10 Prepared by George Nikolaides 4/19/2015 1 Introduction to Apache Spark Another cluster computing framework Developed in the AMPLab at UC Berkeley Started in 2009 Open-sourced in 2010

More information

COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

International Journal of Innovative Research in Computer and Communication Engineering

International Journal of Innovative Research in Computer and Communication Engineering FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Spring,2015 Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Contents: Briefly About Big Data Management What is hive? Hive Architecture Working

More information

Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

More information

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG 1 RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG Background 2 Hive is a data warehouse system for Hadoop that facilitates

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Big Data Weather Analytics Using Hadoop

Big Data Weather Analytics Using Hadoop Big Data Weather Analytics Using Hadoop Veershetty Dagade #1 Mahesh Lagali #2 Supriya Avadhani #3 Priya Kalekar #4 Professor, Computer science and Engineering Department, Jain College of Engineering, Belgaum,

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Detection of Distributed Denial of Service Attack with Hadoop on Live Network Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,

More information

Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

More information

Hadoop Technology for Flow Analysis of the Internet Traffic

Hadoop Technology for Flow Analysis of the Internet Traffic Hadoop Technology for Flow Analysis of the Internet Traffic Rakshitha Kiran P PG Scholar, Dept. of C.S, Shree Devi Institute of Technology, Mangalore, Karnataka, India ABSTRACT: Flow analysis of the internet

More information

Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

More information

Distributed R for Big Data

Distributed R for Big Data Distributed R for Big Data Indrajit Roy HP Vertica Development Team Abstract Distributed R simplifies large-scale analysis. It extends R. R is a single-threaded environment which limits its utility for

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

Big Data Data-intensive Computing Methods, Tools, and Applications (CMSC 34900)

Big Data Data-intensive Computing Methods, Tools, and Applications (CMSC 34900) Big Data Data-intensive Computing Methods, Tools, and Applications (CMSC 34900) Ian Foster Computation Institute Argonne National Lab & University of Chicago 2 3 SQL Overview Structured Query Language

More information

Estimating PageRank Values of Wikipedia Articles using MapReduce

Estimating PageRank Values of Wikipedia Articles using MapReduce Estimating PageRank Values of Wikipedia Articles using MapReduce Due: Sept. 30 Wednesday 5:00PM Submission: via Canvas, individual submission Instructor: Sangmi Pallickara Web page: http://www.cs.colostate.edu/~cs535/assignments.html

More information

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Analysis of Web Archives. Vinay Goel Senior Data Engineer Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner

More information

Big Data Primer. 1 Why Big Data? Alex Sverdlov [email protected]

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com Big Data Primer Alex Sverdlov [email protected] 1 Why Big Data? Data has value. This immediately leads to: more data has more value, naturally causing datasets to grow rather large, even at small companies.

More information

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql Abstract- Today data is increasing in volume, variety and velocity. To manage this data, we have to use databases with massively parallel software running on tens, hundreds, or more than thousands of servers.

More information

Distributed Framework for Data Mining As a Service on Private Cloud

Distributed Framework for Data Mining As a Service on Private Cloud RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &

More information

docs.hortonworks.com

docs.hortonworks.com docs.hortonworks.com Hortonworks Data Platform: Hive Performance Tuning Copyright 2012-2015 Hortonworks, Inc. Some rights reserved. The Hortonworks Data Platform, powered by Apache Hadoop, is a massively

More information

Analysis of MapReduce Algorithms

Analysis of MapReduce Algorithms Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 [email protected] ABSTRACT MapReduce is a programming model

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Data storing and data access

Data storing and data access Data storing and data access Plan Basic Java API for HBase demo Bulk data loading Hands-on Distributed storage for user files SQL on nosql Summary Basic Java API for HBase import org.apache.hadoop.hbase.*

More information

Copyright. Albert Haque

Copyright. Albert Haque Copyright by Albert Haque 2013 The Thesis Committee for Albert Haque Certifies that this is the approved version of the following thesis: A MapReduce Approach to NoSQL RDF Databases Approved By Supervising

More information

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable

More information

Hadoop MapReduce using Cache for Big Data Processing

Hadoop MapReduce using Cache for Big Data Processing Hadoop MapReduce using Cache for Big Data Processing Janani.J Dept. of computer science and engineering Arasu Engineering College Kumbakonam Kalaivani.K Dept. of computer science and engineering Arasu

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Hadoop Ecosystem Overview of this Lecture Module Background Google MapReduce The Hadoop Ecosystem Core components: Hadoop

More information

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,

More information

Fuzzy Clustering Technique for Numerical and Categorical dataset

Fuzzy Clustering Technique for Numerical and Categorical dataset Fuzzy Clustering Technique for Numerical and Categorical dataset Revati Raman Dewangan, Lokesh Kumar Sharma, Ajaya Kumar Akasapu Dept. of Computer Science and Engg., CSVTU Bhilai(CG), Rungta College of

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information

Architecting the Future of Big Data

Architecting the Future of Big Data Hive ODBC Driver User Guide Revised: July 22, 2014 2012-2014 Hortonworks Inc. All Rights Reserved. Parts of this Program and Documentation include proprietary software and content that is copyrighted and

More information

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information