Big Data Analysis Technology

Size: px
Start display at page:

Download "Big Data Analysis Technology"

Transcription

1 Big Data Analysis Technology Tobias Hardes ( ) Seminar: Cloud Computing and Big Data Analysis, L Summer semester 2013 University of Paderborn Abstract This paper introduces several big data processing techniques. Association rule mining is to find relationships in large data sets. For this mining frequent pattern is an important technique. The Apriori algorithm is a classic algorithm to detect these frequent item sets and to generate rules using this item sets. In addition to the Aprioi algorithm, the FP-Growth algorithm needs less scans of the database to extract frequent item sets and performs the mining even faster. Next to the rule mining, the cluster analysis is another technique for analyzing big data sets. For this the K-Means algorithm is the most common one that minimizes the euclidean distance between entities in the same cluster. Furthermore the K-Means++ algorithm will be discussed, that provides an extension for the K-Means. With this preprocessing of the K-Means++, the K- Means converges much faster with a solution that is O(log(k)) competitive to the optimal K-Means solution. Finally Google s MapReduce will be discussed. It is a parallel and distributed computing framework for processing Big Data on a cluster with hundred or thousands of computer. Keywords Big Data, Cloud Computing, Association rule mining, Apriori algorithm, FP-growth algorithm, frequent pattern, K- means, K-means++, MapReduce framework I. INTRODUCTION Since the Business Intelligence area is an important topic for the decision making and reporting of companies, things have changed. Today it is possible to get storage units for less than $600 to store all of the worlds music [?]. The ability to store and to work with data and after that the usage of results to perform further analysis becomes even more accessible as with trends such as Moore s Law [?]. Big Data is a term defining data that has three main characteristics. First, it involves a great volume of data. Second, the data cannot be structured into regular database tables because of variety and third, the data is produced with great velocity and must be captured and processed rapidly [?]. In other literature there is one additional keyword, Veracity [?]. The use of Big Data is needed in order to discover trends or hidden data over a course of activities. In addition to transactional data, there is data from the web, e.g. social networks or sensor networks. Having a closer look to changes in our environment during the last years, we should also mention radio-frequency identification (RFID) tags, geodata (GPS), data from satellites or further medical aspects, which all creates a variety of data. Because of this high volume of data, it is not possible to insert this in traditional databases which already contain terabytes or petabytes of data. Furthermore the use of sensors often leads to a high velocity of new data. New information is created every second and might be computed in real time. With this there is also a challenge of veracity, because there can be wrong values, which have to be detected. The main purpose of this paper is to present different techniques for processing Big Data. The remainder of this paper is organized in this manner. Section II discusses some issues based on the research at CERN and shows some problems which are related to Big Data. Section III discusses advanced and related techniques of Big Data. Section IV discusses techniques of the area Data Mining presenting the most important algorithms and methods to analyze large data sets. Finally section V concludes the aspects described in this paper. II. BACKGROUND In July 2012 the Large Hadron Collider (LHC) experiments ATLAS and CMS announced they had each observed a new particle in the mass region around 126 GeV 1 also known as the Higgs boson [?]. The LHC is the world s largest and most powerful particle accelerator. To find the Higgs boson, approximately 600 million times per second, particles collide within the LHC. Each collision generates particles that often decay in complex ways into even more particles. Sensors are used to record the passage of each particle through a detector as a series of electronic signals, and send the data to the CERN Data Centre for digital reconstruction. The Physicists must sift through the approximately 15 petabytes of data produced annually to determine if the collisions have thrown up any interesting physics [?]. The Worldwide LHC Computing Grid (WLCG) was invented in 2002 to address the issue of missing computational resources. The grid links thousands of computers in 140 centers over 45 countries. Up to now the WLCG is the largest computing grid in the world [?] and it runs more than one million jobs per day. At peak rates, 10 gigabytes of data may be transferred from its servers every second [?], [?]. Similar scenarios exist in the medical area, governments or in the private sector (e.g. Amazon.com, Facebook). Dealing with large data sets becomes more and more important because an accurate data base is important to face problems of the named areas. The challenge is to perform complex analysis algorithms on Big Data to generate or to 1 Billion electron volts. Electron volt is a unit of energy equal to approximately joule

2 find some new knowledge which the data contains. This new knowledge can be used to discover important, temporal and daily problems such as the example with the research at CERN, the analysis of the stock market or the analysis of company data which was gathered during a long period of time. A. Just in time analysis Often it is necessary to analyze data just in time, because it has temporal significance (e.g. stock market or sensor networks). For this it might be necessary to analyze data for a certain time period e.g. the last 30 seconds. Such requirements can be can be addressed through the use of Data Stream Management Systems (DSMS). The software tools are called Complex Event Processing (CEP) engines and the queries are written in declarative languages such as Event Processing Languages (EPL) like Esper [?]. The syntax is similar to the SQL in databases. Listing 1. Sample EPL command s e l e c t avg ( p r i c e ) from S t o c k T i c k E v e n t. win : time (30 s e c ). As shown in the upper example an EPLs supports a window over data streams, which can be used to buffer events during a defined time. This technique is used for data stream analysis. The window can move or slide in time. Figure 1 shows the difference of the two variants. If a window moves as much as the window size this type of window is a tumbling window. The other type is named sliding window. This window slides in time and buffers the last x elements. some additional aspects for Big Data analysis in respect to fault tolerance. Consider the following tweet: After a whole 5 hours away from work, I get to go back again, I m so lucky! This tweet contains sarcasm and it is a complex tasks to detect this. A way to solve this problem is to collect training data to apply this data to an appropriate learning algorithm [?]. B. Rule mining and clustering A more advanced technique is called Association rule mining. Association rule mining searches for relationships between items in data sets, but it can also be implemented for analyzing streams. For this, there are association rules which are used by algorithms to find associations, correlations or causal structures. Chapter IV-A2 and IV-A3 discuss this technique in more detail. Cluster analysis is the tasks of classification of similar objects into groups or classes. The classes are defined during the clustering and it is not a mapping to already pre-defined classes[?]. Today, cluster analysis techniques are used in different areas, such as data compression, process monitoring, analysis of the DNA, finance, radar scanning and further research areas. In all areas a huge data is stored. A clustering algorithm can be hierarchical or partitional [?]. III. RELATED WORK Big Data became a very important topic during the last year. It is still an important area for research but it also reached business. Various companies are working in the field of Big Data, even in research or for consulting: McKinsey & Company [?], IBM Corporation [?], Fujitsu Limited [?], Intel Corporation [?] and many more. This chapter discusses some further developments to improve the algorithms and methods that are usually used to process Big Data. In addition, alternative solutions are presented that can be used for analyzing or processing Big Data. Fig. 1. Windows - Based on [?] The size of the window can be set by the user with the select command. Note that smaller window sizes have less data to compare, so often they result in high false positive rates. If a larger window is used, the effect can be compensated [?]. When using a CEP Engine, the computation is in time. Because of this, the complexity can not be very high, so it might not be possible to detect the Higgs boson using just Data Stream Analysis, because the analysis is too complex. Therefore, further mining algorithms on the full data set are required. Taking a look on the Twitter service, it might be possible to use CEP to analyze the stream of tweets to detect emotional states or something similar. This example shows A. Basic principles There are different techniques which allow the analysis of large data sets. Not all techniques strictly require the use of Big Data, but they can apply to this area. Most algorithms for large item sets are related to the Apriori algorithm that will be discussed in Chapter IV-A2. All algorithms and methods are usually based on the same basic principles. These principles are statistics, probability theory and machine learning in terms of a system that can learn from data. An area called Data Mining is the biggest one. Data Mining can be described as a set of techniques which can be used to extract patterns from small as well as big data sets [?]. In the field of Data Mining the most important techniques are association rule learning, cluster analysis and classification.

3 B. Improvements Today the performance of such algorithms becomes more and more important. Analyzing large data sets results in high computational costs. In recent years much progress has been been reached by the following directions [?]: Selection of the startup parameter for the algorithms Reducing the number of passes over the database Sampling the database Adding extra constraints on the structure of patterns Parallelization Chapter IV-B1 discusses the K-means method which is used for clustering. The classic implementation has an exponential runtime of O(2 Ω(n) )[?]. In [?], Arthur et al. published an alternative implementation which leads to a lower runtime by improving the start parameters of the algorithm. Other implementations pursue an optimization approach to reduce the runtime. There are also many variations done on the Apriori algorithm in such a way. These variations are discussed in [?]. Many of these variations are concerned with the generation of a temporal data set, because this part has the greatest computational cost. In [?] Yang Kai et al discussed a new algorithm called FA- DMFI that is used for discovering the maximum frequent item sets. For this, the algorithm stores association information by scanning the database only once. Then the temporarily data set is stored in an association matrix. Summarized the efficiency is achieved in two steps. 1) The huge database is compressed in a smaller data structure which can avoid repeated and costly database scans. 2) The maximum frequent item set is generated ultimately by means of cover relation in such a information matrix and the costly generation of large number of candidates is avoided. With these steps the I/O time and CPU time are reduced and the generation of the candidate item set and the frequent item set can be done in less time. The Apriori algorithm, that is discussed in Chapter IV-A2, spends a lot of time generating the frequent item sets and it needs to scan the data source multiple times. Using some implementation details of the FA-DMFI algorithm could also improve the runtime of the Apriori algorithm. MapReduce is a distributed computing technology that was developed by Google. The programming model allows to implement custom mapper and reducer functions programmatically and run batch processes on hundreds or thousands of computers. Chapter IV-C discusses this method in more detail. Based on this technology Google developed a web service called Google BigQuery which can be used for analyzing large data sets. BigQuery is based on Dremel which is a query service that allows to run SQL-like queries against very large data sets and gets accurate results in seconds [?]. So the data set can be accessed and queried but BigQuery is not meant to to execute complex mining. Usually the queries are executed with Microsoft Office or the Google Chrome browser. Figure 2 gives an example for the BigQuery workflow. Fig. 2. Upload Process Analyse Google BigQuery Upload thedatasetto the Google Storage Import data to tables Run queries With this approach Google tries so offer a suitable system for OLAP (Online Analytical Processing) or BI (Business Intelligence) using Big Data. Here, most queries are quite simple and done with simple aggregations or filtering using columns. Because of this BigQuery isn t a programming model for deep analysis of Big Data. It is a query service with functionality to perform a quick analysis like aggregation of data and there are no possibilities to implement user code [?]. D. Big Data and Cloud Computing C. Distributed computing Often the computational tasks are very complex and the data which is needed for an individual task is serveral terabytes or even more. That is why it is not possible to compute these tasks on a single machine because there are not enough hardware resources. These are the reasons why parallelization of computational tasks become more and more important. In [?], Wangl et al present a development platform based on virtual clusters. Virtual clusters are a kind of virtual machines provided by cloud computing. The platform presented in [?] provides techniques for parallel data mining based on these virtual clusters. To measure the performance of the system, Wangl et al used a PSO (Particle Swarm Optimization) algorithm which was parallelized. The platform reduces the computing time of the swarm and improves the cluster quality. To use the advantages of a parallel architecture a corresponding architecture is required.

4 IV. MAIN APPROACHES This section analyses and describes the implementation details and practical issues of Big Data analysis techniques. The focus is on algorithms for associated rule mining and cluster analysis. The last part discusses a technique for distributed and parallel analysis of large data sets. The Subsection IV-A2 discusses the Apriori algorithm, Subsection IV-A3 discusses the Frequent Pattern (FP)-Growth algorithm and Subsection IV-B1 discusses the K-means and K-means++ algorithm. Finally Subsection IV-C discusses the MapReduce framework. A. Associated rule mining Association mining can be used to identify items that are related to other items. A concrete daily life example could be the analysis of baskets in an online shop or even in the supermarket. Because each customer purchases different combinations of products at different times there are questions that can be addressed with associated rule mining: Why do customers buy certain products? Who are the customers? (Students, families,...) Which products might benefit from advertising? The calculated rules can be used to restructure the market or to send special promotions to certain types of customers. This chapter discusses the Apriori and the FP-Growth algorithm. Both algorithm often provide the basis for further specialized algorithms. A common strategy, which is used by association rule mining algorithms, is to disjoin the problem in two tasks: 1) Generation of frequent item sets: The task is to find item sets that satisfy a minimum support that is usually defined by the user s code. 2) Generation of rules: The task is to find confidence rules using the frequent item sets. The Confidence is calculated as the relative percentage of records which contain both A and B. It determines how frequently items in B appear in a transaction that contains A: conf(a = B) = Support(A B) Support(A) Additionally the frequency of the item set is defined as the number of transactions in S that contains the item set I. L k denotes the frequent item sets of length k [?]. The example on Page 5 shows an daily situation where association rule mining is used. 2) Apriori algorithm: This section discusses the Apriori algorithm which is one of the most common algorithms for association rule mining. The Apriori algorithm is using the prior knowledge to mine frequent item sets in a database or stream. It is actually a layer-by-layer iterative search algorithm, where the item set k is used to analyze the item set k + 1. The algorithm is based on a simple principle: If an item set is frequent, then all subsets of this item set are also frequent. To use the algorithm, the user specifies the minimum support (min sup), the data source (S) and the minimum confidence (min conf) (See Chapter IV-A1 for terminology). The first task is to extract frequent item sets. Extract frequent item sets: Before the iteration starts, the algorithm scans the database to find the number of each item. This item has to satisfy the min sup input. Figure 4 shows the pseudocode for the implementation of the Apriori algorithm. Figure 3 shows the according activity diagram using an UML (Unified Modeling Language) activity diagram. The Apriori algorithm (see Chapter IV-A2) and the FP-Growth algorithm (see Chapter IV-A3) are both based on this principle. 1) Terminology: The following terminology has been taken from [?], [?], [?]. These definitions were adapted in order to discuss the algorithms of this chapter more accurately. { Let S be a stream } or a database with n elements. Then I = i1, i 2,..., i n is an item set with n items. So an item set could be a basket in a supermarket containing n elements. Furthermore an association rule is of the form A = B where A I and B I, A B =. So the rule mining can extract a trend such as when stocks of credit institutes go down, companies for banking equipment follow. Using these rules there are no restrictions on the number of items which are used for such a rule [?]. The Support of an item set is the percentage of records which contain the specific item set. Assume that φ is the frequency of occurrence of an item set. Then the support of an item set A is: sup(a) = φ(a) S Fig. 3. [?] Activity diagram - Apriori frequent item set generation - Notation:

5 1: procedure APRIORI FREQUENTITEMSETS(min sup, S) 2: L 1 itemsets 3: for k = 2; L k 1 ; k + + do 4: C k = apriorigen(l k 1 ) Create the candidates 5: for each c C k do 6: c.count 0 7: end for 8: for each I S do 9: C r subset(c k, I) Identify candidates that belong to I 10: for each c C r do 11: c.count + + Counting the support values 12: end for 13: end for 14: if c.count min sup then 15: L k = L k c 16: end if 17: end for 18: return L k 19: end procedure Fig. 4. Apriori algorithm - Based on [?], [?] The algorithm outputs a set of itemsets I with support(i) min sup. The function apriorigen in line 4 is the part where the set of candidates is created. It takes L k 1 as an argument. This function is performed in two steps: 1) Join: Generation of a new candidate itemset C k. 2) Prune: Elimination of item sets with support(i j ) < min sup With this it is assumed that the items are ordered, e.g. in an alphabetical way. Assume that there are distinct pairs of sets in L k 1 = {{a 1, a 2,..., a k 1 }, {b 1, b 2,..., b k 1 },..., {n 1, n 2,..., n k 1 }} with a i b i... n i, 1 i k 1. Then the join-step joins all k-1 itemsets that differ by only the last element. The following SQL-Statement shows a way how to select those items: select p.item 1, p.item 2,..., p.item k 1, q.item k 1 from L k 1 p, L k 1 q where p.item 1 = q.item 2,..., p.item k 2 = q.item k 2, p item k 1 < q.item k 1 The line 8 to 13 (Figure 4) are used for the support counting. This is done by comparing each transaction I with the candidates C k (line 9) and to increment the support. This is a very expensive tasks specially if the number of transactions and candidates is very large. The result of the statement is added to the set C k. Now C k is used to proceed with the prune step. In the prune step, the set L k+1 is generated using the candidates of the set C k by pruning those combinations of items which can not satisfy the min sup value (Line 14). After the For-loop in line 3 terminates, the algorithm outputs a set of item sets were the support is greater than the value min sup [?], [?], [?]. With this it is possible to generate association rules. Generation of association rules: The generation of the association rules is similar to the generation of the frequent item sets and it is based on the same principle. Figure 5 and Figure 6 show the pseudocode for the implementation of the Apriori algorithm. 1: procedure APRIORI RULEGEN(min conf, s) 2: for each itemsetl k do 3: H 1 = {i i f k } Rules with 1 item consequence 4: Genrules(L k, H 1, min conf) 5: end for 6: end procedure Fig. 5. Apriori rule generation - Based on [?] 1: procedure GENRULES(L k, H 1, min conf) 2: if k > m + 1 then 3: H m+1 = apriorigen(h m ) 4: for each h m+1 H m+1 do 5: conf = Support(L k )/Support(L k h m+1 ) 6: if conf min conf then 7: output(l k h m+1 ) h m+1 8: else 9: H m+1 H m+1 \h m+1 10: end if 11: end for 12: Genrules(L k, H 1, min conf) 13: end if 14: end procedure Fig. 6. Apriori Genrules - Based on [?] First, all confidence rules are extracted using the frequent item sets. These rules are used to generate new candidate rules. For example, if the rules {a, b, c} {d} and {a, f, c} {g} are rules that fit the minimum confidence then there will be a candidate rule {a, c} {d, g}. This is done by merging the result of both rules. The new rule has to satisfy the min conf value, which was specified by the user. The generation of such rules is done iteratively. The function apriorigen (Figure 6, line 3) is used again to generate candidates for new rules. Then the set of candidates is checked for the min conf value (Figure 6, line 6). For this the same assumption as in the generation of the frequent item sets is used. So if there is a new rule that does not satisfy this assumption, all subsets of that rule won t satisfy the assumption as well. Because of this the rule can be removed (Figure 6, line 9). Example - Association rule mining: Usually the algorithms for associated rule mining are used for basket analysis in online stores. To give an example, assume following transactions: Assume the rule Bread = Butter. To compute the confidence for this rule, the support it the item set is needed:

6 sup(bread = Butter) = Transaction Item T1 {Coffee, P asta, Milk} T2 {P asta, Milk} T3 {Bread, Butter} T4 {Coffee, Milk, Butter} T5 {Milk, Bread, Butter} TABLE I. TRANSACTIONS φ({bread, Butter}) S Then the confidence for this rule is computed as follow: = 2 5 = 40% Support(Bread Butter) conf(bread = Butter) = Support(Bread) = 40% 40% = 100% So the rule Bread = Butter has a confidence of 100%. So if there is bread in a basket, the probability that the same basket also contains butter is about 100%. Now assume the rule Butter = M ilk. Again the support value has to be calculated: sup(butter = Milk) = φ({butter, Milk}) S = 2 5 = 40% The calculation of the confidence is done as above: Support(Butter Milk) conf(butter = Milk) = Support(Butter) = 40% 60% 66% So the rule Butter = Milk has a confidence of 66%. So if there is butter in a basket, the probability that the same basket also contains milk is about 66%. Using such rules to identify trends can be critical because a trend might not hold for long. If the streaming data outputs rules, the reaction of the data analysts is to increase the support and confidence values to obtain fewer rules with stronger confidence, but rules may have already become outdated by the end of that period. An example for this could be sales of ice-cream during the hottest month in the summer. After the trend is gone, there will be no sales opportunity. Another example could be the stock market, which was already mentioned in Chapter II-B. 3) Frequent Pattern (FP)-Growth algorithm: The Frequent Pattern (FP)-Growth method is used with databases and not with streams. The Apriori algorithm needs n + 1 scans if a database is used, where n is the length of the longest pattern. By using the FP-Growth method, the number of scans of the entire database can be reduced to two. The algorithm extracts frequent item sets that can be used to extract association rules. This is done using the support of an item set. The terminology, that is used for this algorithm is described in chapter IV-A1. The main idea of the algorithm is to use a divide and conquer strategy: Compress the database which provides the frequent sets; then divide this compressed database into a set of conditional databases, each associated with a frequent set and apply data mining on each database??. To compress the data source, a special data structure called the FP-Tree is needed [?]. The tree is used for the data mining part. Finally the algorithm works in two steps: 1) Construction of the FP-Tree 2) Extract frequent item sets a) Construction of the FP-Tree: The FP-Tree is a compressed representation of the input. While reading the data source each transaction t is mapped to a path in the FP-Tree. As different transaction can have several items in common, their path may overlap. With this it is possible to compress the structure. Figure 7 shows an example for the generation of an FP-tree using 10 transactions. Related to the example on Page 5, an item like a, b, c or d could be an item of a basket e.g. a product which was purchased in a supermarket. TID Items 1 {a,b} 2 {b,c,d} 3 {a,c,d,e} 4 {a,d,e} 5 {a,b,c} 6 {a,b,c,d} 7 {a} 8 {a,b,c} 9 {a,b,d} 10 {b,c,e} (i) After reading TID =1 (ii) After reading TID =2 (iii) After reading TID =3 (iv) After reading TID =10 Fig. 7. Construction of an FP-tree - Based on [?] The FP-Tree is generated in a simple way. First a transaction t is read from the database. The algorithm checks whether the prefix of t maps to a path in the FP-Tree. If this is the case the support count of the corresponding nodes in the tree are incremented. If there is no overlapped path, new nodes are created with a support count of 1. Figure 8 shows the corresponding activity diagram using an UML (Unified Modeling Language) activity diagram. Additional a FP-Tree uses pointers connecting between nodes that have the same items creating a singly linked list. These pointers, represented as dashed lines in Figure 7, 9 and 10, are used to access individual items in the tree even faster. The corresponding FP-Tree is used to extract frequent item sets directly from this structure. Each node in the tree contains the label of an item along with a counter that shows the number of transactions mapped onto the given path. In the best case scenario there is only a single node, because all transactions have the same set of items. A worst case

7 item sets ending with e and then with d, c, b and a until the root is reached. Using the pointers, each the paths can be accessed very efficient by following the list. Furthermore each path of the tree can be processed recursively to extract the frequent item sets, so the problem can be divided into smaller subproblems. All solutions are merged at the end. This strategy allows to execute the algorithm parallel on multiple machines [?]. Fig. 8. Activity diagram - Construction of the FP-Tree - Notation: [?] scenario would be a data source where every transaction has a unique set of items. Usually the FP-tree is smaller than the uncompressed one, because many transactions share items. As already mentioned the algorithm has to scan the data source twice. Pass 1 The data set is scanned to determine the support of each item. The infrequent items are discarded and not used in the FP-Tree. All frequent items are ordered based on their support. Pass 2 The algorithm does the second pass over the data to construct the FP-tree. The following example shows how the algorithm works. According to Figure 7 the first transaction is {a, b}. 1) Because the tree is empty, two nodes a and b with counter 1 are created and the path null a b is created. 2) After {b, c, d} was read, three new nodes b, c and d have to be created. The value for count is 1 and a new path null b c d is created. Because the value b was already in transaction one, there is a new pointer between the b s (dashed lines). 3) The transaction {a, c, d, e} overlaps with transaction one, because of the a in the first place. The frequency count for a will be incremented by 1. Additional pointers between the c s and d s are added. After each transaction was scanned, a full FP-Tree is created (Figure 7-iv). Now the FP-Growth algorithm uses the tree to extract frequent item sets. b) Extract frequent item sets: A bottom-up strategy starts with the leaves and moves up to the root using a divide and conquer strategy. Because every transaction is mapped on a path in the FP-Tree, it is possible to mine frequent item sets ending in a particular item, for example e or d. So according to Figure 9, the algorithm first searches for frequent Fig. 9. Subproblems - Based on [?] The FP-Growth algorithm finds all item sets ending with a specified suffix using the divide and conquer strategy. Assume the algorithm analyzes item sets ending with e. To do so, first the item set e has to be frequent. This can be done using the corresponding FP-Tree ending in e (Figure 10 (a)). If it is frequent, the algorithm has to solve the subproblem of finding frequent item sets ending in de, ce, be and ae. These subproblems are solved using the conditional FP-Tree (Figure 10 (b)). The following example (Figure 10) shows how the algorithm solves the subproblems with the task of finding frequent item sets ending with e [?]. Assume the minimum support is set to two. 1) The first step is to collect the initial prefix path (Figure 10 a) 2) From this prefix path the support count is calculated by adding all support counts with node e. In the example the support count is 3.. 3) Because 3 2 = min sup the algorithm has to solve the subproblems of finding frequent item sets ending with de, be, ce and ae. To solve the subproblems the prefix path has to be converted into a conditional FP- Tree ( Figure 10 b). This tree is used to find frequent item sets ending with a specific suffix. a) Update the support count along the prefix path that don t contain e. Consider the path far right of the tree: null b : 2 c : 2

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Distributed Apriori in Hadoop MapReduce Framework

Distributed Apriori in Hadoop MapReduce Framework Distributed Apriori in Hadoop MapReduce Framework By Shulei Zhao (sz2352) and Rongxin Du (rd2537) Individual Contribution: Shulei Zhao: Implements centralized Apriori algorithm and input preprocessing

More information

The basic data mining algorithms introduced may be enhanced in a number of ways.

The basic data mining algorithms introduced may be enhanced in a number of ways. DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Data mining algorithms have traditionally assumed data is memory resident,

More information

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance. Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analytics

More information

Big Data and Market Surveillance. April 28, 2014

Big Data and Market Surveillance. April 28, 2014 Big Data and Market Surveillance April 28, 2014 Copyright 2014 Scila AB. All rights reserved. Scila AB reserves the right to make changes to the information contained herein without prior notice. No part

More information

Bringing Big Data Modelling into the Hands of Domain Experts

Bringing Big Data Modelling into the Hands of Domain Experts Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm

Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm R. Sridevi et al Int. Journal of Engineering Research and Applications RESEARCH ARTICLE OPEN ACCESS Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm R. Sridevi,*

More information

Hadoop and Map-reduce computing

Hadoop and Map-reduce computing Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.

More information

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Processing with Google s MapReduce. Alexandru Costan 1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

More information

MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM

MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM J. Arokia Renjit Asst. Professor/ CSE Department, Jeppiaar Engineering College, Chennai, TamilNadu,India 600119. Dr.K.L.Shunmuganathan

More information

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

A Novel Cloud Based Elastic Framework for Big Data Preprocessing School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

Map-Reduce for Machine Learning on Multicore

Map-Reduce for Machine Learning on Multicore Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,

More information

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

Lecture 10 - Functional programming: Hadoop and MapReduce

Lecture 10 - Functional programming: Hadoop and MapReduce Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional

More information

ISSN: 2320-1363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

ISSN: 2320-1363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS A.Divya *1, A.M.Saravanan *2, I. Anette Regina *3 MPhil, Research Scholar, Muthurangam Govt. Arts College, Vellore, Tamilnadu, India Assistant

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

Firebird meets NoSQL (Apache HBase) Case Study

Firebird meets NoSQL (Apache HBase) Case Study Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

Big Systems, Big Data

Big Systems, Big Data Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,

More information

International Journal of Innovative Research in Computer and Communication Engineering

International Journal of Innovative Research in Computer and Communication Engineering FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,

More information

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 4, April 2015,

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani Big Data and Hadoop Sreedhar C, Dr. D. Kavitha, K. Asha Rani Abstract Big data has become a buzzword in the recent years. Big data is used to describe a massive volume of both structured and unstructured

More information

The 3 questions to ask yourself about BIG DATA

The 3 questions to ask yourself about BIG DATA The 3 questions to ask yourself about BIG DATA Do you have a big data problem? Companies looking to tackle big data problems are embarking on a journey that is full of hype, buzz, confusion, and misinformation.

More information

Association Rule Mining

Association Rule Mining Association Rule Mining Association Rules and Frequent Patterns Frequent Pattern Mining Algorithms Apriori FP-growth Correlation Analysis Constraint-based Mining Using Frequent Patterns for Classification

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Big Data Challenges in Bioinformatics

Big Data Challenges in Bioinformatics Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres Jordi.Torres@bsc.es Talk outline! We talk about Petabyte?

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

Analytics on Big Data

Analytics on Big Data Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

III Big Data Technologies

III Big Data Technologies III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

More information

Big Data with Rough Set Using Map- Reduce

Big Data with Rough Set Using Map- Reduce Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,

More information

Map Reduce / Hadoop / HDFS

Map Reduce / Hadoop / HDFS Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview

More information

Testing the In-Memory Column Store for in-database physics analysis. Dr. Maaike Limper

Testing the In-Memory Column Store for in-database physics analysis. Dr. Maaike Limper Testing the In-Memory Column Store for in-database physics analysis Dr. Maaike Limper About CERN CERN - European Laboratory for Particle Physics Support the research activities of 10 000 scientists from

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Chapter 20: Data Analysis

Chapter 20: Data Analysis Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON BIG DATA MANAGEMENT AND ITS SECURITY PRUTHVIKA S. KADU 1, DR. H. R.

More information

Big Application Execution on Cloud using Hadoop Distributed File System

Big Application Execution on Cloud using Hadoop Distributed File System Big Application Execution on Cloud using Hadoop Distributed File System Ashkan Vates*, Upendra, Muwafaq Rahi Ali RPIIT Campus, Bastara Karnal, Haryana, India ---------------------------------------------------------------------***---------------------------------------------------------------------

More information

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the

More information

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014 Big Data Analytics An Introduction Oliver Fuchsberger University of Paderborn 2014 Table of Contents I. Introduction & Motivation What is Big Data Analytics? Why is it so important? II. Techniques & Solutions

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

Using an In-Memory Data Grid for Near Real-Time Data Analysis

Using an In-Memory Data Grid for Near Real-Time Data Analysis SCALEOUT SOFTWARE Using an In-Memory Data Grid for Near Real-Time Data Analysis by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 IN today s competitive world, businesses

More information

Hadoop Cluster Applications

Hadoop Cluster Applications Hadoop Overview Data analytics has become a key element of the business decision process over the last decade. Classic reporting on a dataset stored in a database was sufficient until recently, but yesterday

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

Big Data: Tools and Technologies in Big Data

Big Data: Tools and Technologies in Big Data Big Data: Tools and Technologies in Big Data Jaskaran Singh Student Lovely Professional University, Punjab Varun Singla Assistant Professor Lovely Professional University, Punjab ABSTRACT Big data can

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

More information

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW AGENDA What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story Hadoop PDW Our BIG DATA Roadmap BIG DATA? Volume 59% growth in annual WW information 1.2M Zetabytes (10 21 bytes) this

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02) Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we

More information

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop ISSN: 2454-2377, October 2015 Big Data and Hadoop Simmi Bagga 1 Satinder Kaur 2 1 Assistant Professor, Sant Hira Dass Kanya MahaVidyalaya, Kala Sanghian, Distt Kpt. INDIA E-mail: simmibagga12@gmail.com

More information

Project Report BIG-DATA CONTENT RETRIEVAL, STORAGE AND ANALYSIS FOUNDATIONS OF DATA-INTENSIVE COMPUTING. Masters in Computer Science

Project Report BIG-DATA CONTENT RETRIEVAL, STORAGE AND ANALYSIS FOUNDATIONS OF DATA-INTENSIVE COMPUTING. Masters in Computer Science Data Intensive Computing CSE 486/586 Project Report BIG-DATA CONTENT RETRIEVAL, STORAGE AND ANALYSIS FOUNDATIONS OF DATA-INTENSIVE COMPUTING Masters in Computer Science University at Buffalo Website: http://www.acsu.buffalo.edu/~mjalimin/

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

Binary Coded Web Access Pattern Tree in Education Domain

Binary Coded Web Access Pattern Tree in Education Domain Binary Coded Web Access Pattern Tree in Education Domain C. Gomathi P.G. Department of Computer Science Kongu Arts and Science College Erode-638-107, Tamil Nadu, India E-mail: kc.gomathi@gmail.com M. Moorthi

More information

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities Technology Insight Paper Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities By John Webster February 2015 Enabling you to make the best technology decisions Enabling

More information

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,

More information

Data Mining in the Swamp

Data Mining in the Swamp WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all

More information

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

More information

Study of Data Mining algorithm in cloud computing using MapReduce Framework

Study of Data Mining algorithm in cloud computing using MapReduce Framework Study of Data Mining algorithm in cloud computing using MapReduce Framework Viki Patil, M.Tech, Department of Computer Engineering and Information Technology, V.J.T.I., Mumbai Prof. V. B. Nikam, Professor,

More information

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able

More information

Frequent item set mining

Frequent item set mining Frequent item set mining Christian Borgelt Frequent item set mining is one of the best known and most popular data mining methods. Originally developed for market basket analysis, it is used nowadays for

More information

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

DATA CLUSTERING USING MAPREDUCE

DATA CLUSTERING USING MAPREDUCE DATA CLUSTERING USING MAPREDUCE by Makho Ngazimbi A project submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science Boise State University March 2009

More information

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.

More information

Statistical Learning Theory Meets Big Data

Statistical Learning Theory Meets Big Data Statistical Learning Theory Meets Big Data Randomized algorithms for frequent itemsets Eli Upfal Brown University Data, data, data In God we trust, all others (must) bring data Prof. W.E. Deming, Statistician,

More information

Data Mining Applications in Manufacturing

Data Mining Applications in Manufacturing Data Mining Applications in Manufacturing Dr Jenny Harding Senior Lecturer Wolfson School of Mechanical & Manufacturing Engineering, Loughborough University Identification of Knowledge - Context Intelligent

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing. Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,

More information

Analysis of MapReduce Algorithms

Analysis of MapReduce Algorithms Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model

More information

Are You Ready for Big Data?

Are You Ready for Big Data? Are You Ready for Big Data? Jim Gallo National Director, Business Analytics February 11, 2013 Agenda What is Big Data? How do you leverage Big Data in your company? How do you prepare for a Big Data initiative?

More information

Laboratory Module 8 Mining Frequent Itemsets Apriori Algorithm

Laboratory Module 8 Mining Frequent Itemsets Apriori Algorithm Laboratory Module 8 Mining Frequent Itemsets Apriori Algorithm Purpose: key concepts in mining frequent itemsets understand the Apriori algorithm run Apriori in Weka GUI and in programatic way 1 Theoretical

More information

CS 378 Big Data Programming. Lecture 2 Map- Reduce

CS 378 Big Data Programming. Lecture 2 Map- Reduce CS 378 Big Data Programming Lecture 2 Map- Reduce MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is processed But viewed in small increments

More information

Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Big Data. Fast Forward. Putting data to productive use

Big Data. Fast Forward. Putting data to productive use Big Data Putting data to productive use Fast Forward What is big data, and why should you care? Get familiar with big data terminology, technologies, and techniques. Getting started with big data to realize

More information

Mammoth Scale Machine Learning!

Mammoth Scale Machine Learning! Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

More information