Big Data Analysis Technology
|
|
|
- Kathlyn Allyson Lewis
- 10 years ago
- Views:
Transcription
1 Big Data Analysis Technology Tobias Hardes ( ) Seminar: Cloud Computing and Big Data Analysis, L Summer semester 2013 University of Paderborn Abstract This paper introduces several big data processing techniques. Association rule mining is to find relationships in large data sets. For this mining frequent pattern is an important technique. The Apriori algorithm is a classic algorithm to detect these frequent item sets and to generate rules using this item sets. In addition to the Aprioi algorithm, the FP-Growth algorithm needs less scans of the database to extract frequent item sets and performs the mining even faster. Next to the rule mining, the cluster analysis is another technique for analyzing big data sets. For this the K-Means algorithm is the most common one that minimizes the euclidean distance between entities in the same cluster. Furthermore the K-Means++ algorithm will be discussed, that provides an extension for the K-Means. With this preprocessing of the K-Means++, the K- Means converges much faster with a solution that is O(log(k)) competitive to the optimal K-Means solution. Finally Google s MapReduce will be discussed. It is a parallel and distributed computing framework for processing Big Data on a cluster with hundred or thousands of computer. Keywords Big Data, Cloud Computing, Association rule mining, Apriori algorithm, FP-growth algorithm, frequent pattern, K- means, K-means++, MapReduce framework I. INTRODUCTION Since the Business Intelligence area is an important topic for the decision making and reporting of companies, things have changed. Today it is possible to get storage units for less than $600 to store all of the worlds music [?]. The ability to store and to work with data and after that the usage of results to perform further analysis becomes even more accessible as with trends such as Moore s Law [?]. Big Data is a term defining data that has three main characteristics. First, it involves a great volume of data. Second, the data cannot be structured into regular database tables because of variety and third, the data is produced with great velocity and must be captured and processed rapidly [?]. In other literature there is one additional keyword, Veracity [?]. The use of Big Data is needed in order to discover trends or hidden data over a course of activities. In addition to transactional data, there is data from the web, e.g. social networks or sensor networks. Having a closer look to changes in our environment during the last years, we should also mention radio-frequency identification (RFID) tags, geodata (GPS), data from satellites or further medical aspects, which all creates a variety of data. Because of this high volume of data, it is not possible to insert this in traditional databases which already contain terabytes or petabytes of data. Furthermore the use of sensors often leads to a high velocity of new data. New information is created every second and might be computed in real time. With this there is also a challenge of veracity, because there can be wrong values, which have to be detected. The main purpose of this paper is to present different techniques for processing Big Data. The remainder of this paper is organized in this manner. Section II discusses some issues based on the research at CERN and shows some problems which are related to Big Data. Section III discusses advanced and related techniques of Big Data. Section IV discusses techniques of the area Data Mining presenting the most important algorithms and methods to analyze large data sets. Finally section V concludes the aspects described in this paper. II. BACKGROUND In July 2012 the Large Hadron Collider (LHC) experiments ATLAS and CMS announced they had each observed a new particle in the mass region around 126 GeV 1 also known as the Higgs boson [?]. The LHC is the world s largest and most powerful particle accelerator. To find the Higgs boson, approximately 600 million times per second, particles collide within the LHC. Each collision generates particles that often decay in complex ways into even more particles. Sensors are used to record the passage of each particle through a detector as a series of electronic signals, and send the data to the CERN Data Centre for digital reconstruction. The Physicists must sift through the approximately 15 petabytes of data produced annually to determine if the collisions have thrown up any interesting physics [?]. The Worldwide LHC Computing Grid (WLCG) was invented in 2002 to address the issue of missing computational resources. The grid links thousands of computers in 140 centers over 45 countries. Up to now the WLCG is the largest computing grid in the world [?] and it runs more than one million jobs per day. At peak rates, 10 gigabytes of data may be transferred from its servers every second [?], [?]. Similar scenarios exist in the medical area, governments or in the private sector (e.g. Amazon.com, Facebook). Dealing with large data sets becomes more and more important because an accurate data base is important to face problems of the named areas. The challenge is to perform complex analysis algorithms on Big Data to generate or to 1 Billion electron volts. Electron volt is a unit of energy equal to approximately joule
2 find some new knowledge which the data contains. This new knowledge can be used to discover important, temporal and daily problems such as the example with the research at CERN, the analysis of the stock market or the analysis of company data which was gathered during a long period of time. A. Just in time analysis Often it is necessary to analyze data just in time, because it has temporal significance (e.g. stock market or sensor networks). For this it might be necessary to analyze data for a certain time period e.g. the last 30 seconds. Such requirements can be can be addressed through the use of Data Stream Management Systems (DSMS). The software tools are called Complex Event Processing (CEP) engines and the queries are written in declarative languages such as Event Processing Languages (EPL) like Esper [?]. The syntax is similar to the SQL in databases. Listing 1. Sample EPL command s e l e c t avg ( p r i c e ) from S t o c k T i c k E v e n t. win : time (30 s e c ). As shown in the upper example an EPLs supports a window over data streams, which can be used to buffer events during a defined time. This technique is used for data stream analysis. The window can move or slide in time. Figure 1 shows the difference of the two variants. If a window moves as much as the window size this type of window is a tumbling window. The other type is named sliding window. This window slides in time and buffers the last x elements. some additional aspects for Big Data analysis in respect to fault tolerance. Consider the following tweet: After a whole 5 hours away from work, I get to go back again, I m so lucky! This tweet contains sarcasm and it is a complex tasks to detect this. A way to solve this problem is to collect training data to apply this data to an appropriate learning algorithm [?]. B. Rule mining and clustering A more advanced technique is called Association rule mining. Association rule mining searches for relationships between items in data sets, but it can also be implemented for analyzing streams. For this, there are association rules which are used by algorithms to find associations, correlations or causal structures. Chapter IV-A2 and IV-A3 discuss this technique in more detail. Cluster analysis is the tasks of classification of similar objects into groups or classes. The classes are defined during the clustering and it is not a mapping to already pre-defined classes[?]. Today, cluster analysis techniques are used in different areas, such as data compression, process monitoring, analysis of the DNA, finance, radar scanning and further research areas. In all areas a huge data is stored. A clustering algorithm can be hierarchical or partitional [?]. III. RELATED WORK Big Data became a very important topic during the last year. It is still an important area for research but it also reached business. Various companies are working in the field of Big Data, even in research or for consulting: McKinsey & Company [?], IBM Corporation [?], Fujitsu Limited [?], Intel Corporation [?] and many more. This chapter discusses some further developments to improve the algorithms and methods that are usually used to process Big Data. In addition, alternative solutions are presented that can be used for analyzing or processing Big Data. Fig. 1. Windows - Based on [?] The size of the window can be set by the user with the select command. Note that smaller window sizes have less data to compare, so often they result in high false positive rates. If a larger window is used, the effect can be compensated [?]. When using a CEP Engine, the computation is in time. Because of this, the complexity can not be very high, so it might not be possible to detect the Higgs boson using just Data Stream Analysis, because the analysis is too complex. Therefore, further mining algorithms on the full data set are required. Taking a look on the Twitter service, it might be possible to use CEP to analyze the stream of tweets to detect emotional states or something similar. This example shows A. Basic principles There are different techniques which allow the analysis of large data sets. Not all techniques strictly require the use of Big Data, but they can apply to this area. Most algorithms for large item sets are related to the Apriori algorithm that will be discussed in Chapter IV-A2. All algorithms and methods are usually based on the same basic principles. These principles are statistics, probability theory and machine learning in terms of a system that can learn from data. An area called Data Mining is the biggest one. Data Mining can be described as a set of techniques which can be used to extract patterns from small as well as big data sets [?]. In the field of Data Mining the most important techniques are association rule learning, cluster analysis and classification.
3 B. Improvements Today the performance of such algorithms becomes more and more important. Analyzing large data sets results in high computational costs. In recent years much progress has been been reached by the following directions [?]: Selection of the startup parameter for the algorithms Reducing the number of passes over the database Sampling the database Adding extra constraints on the structure of patterns Parallelization Chapter IV-B1 discusses the K-means method which is used for clustering. The classic implementation has an exponential runtime of O(2 Ω(n) )[?]. In [?], Arthur et al. published an alternative implementation which leads to a lower runtime by improving the start parameters of the algorithm. Other implementations pursue an optimization approach to reduce the runtime. There are also many variations done on the Apriori algorithm in such a way. These variations are discussed in [?]. Many of these variations are concerned with the generation of a temporal data set, because this part has the greatest computational cost. In [?] Yang Kai et al discussed a new algorithm called FA- DMFI that is used for discovering the maximum frequent item sets. For this, the algorithm stores association information by scanning the database only once. Then the temporarily data set is stored in an association matrix. Summarized the efficiency is achieved in two steps. 1) The huge database is compressed in a smaller data structure which can avoid repeated and costly database scans. 2) The maximum frequent item set is generated ultimately by means of cover relation in such a information matrix and the costly generation of large number of candidates is avoided. With these steps the I/O time and CPU time are reduced and the generation of the candidate item set and the frequent item set can be done in less time. The Apriori algorithm, that is discussed in Chapter IV-A2, spends a lot of time generating the frequent item sets and it needs to scan the data source multiple times. Using some implementation details of the FA-DMFI algorithm could also improve the runtime of the Apriori algorithm. MapReduce is a distributed computing technology that was developed by Google. The programming model allows to implement custom mapper and reducer functions programmatically and run batch processes on hundreds or thousands of computers. Chapter IV-C discusses this method in more detail. Based on this technology Google developed a web service called Google BigQuery which can be used for analyzing large data sets. BigQuery is based on Dremel which is a query service that allows to run SQL-like queries against very large data sets and gets accurate results in seconds [?]. So the data set can be accessed and queried but BigQuery is not meant to to execute complex mining. Usually the queries are executed with Microsoft Office or the Google Chrome browser. Figure 2 gives an example for the BigQuery workflow. Fig. 2. Upload Process Analyse Google BigQuery Upload thedatasetto the Google Storage Import data to tables Run queries With this approach Google tries so offer a suitable system for OLAP (Online Analytical Processing) or BI (Business Intelligence) using Big Data. Here, most queries are quite simple and done with simple aggregations or filtering using columns. Because of this BigQuery isn t a programming model for deep analysis of Big Data. It is a query service with functionality to perform a quick analysis like aggregation of data and there are no possibilities to implement user code [?]. D. Big Data and Cloud Computing C. Distributed computing Often the computational tasks are very complex and the data which is needed for an individual task is serveral terabytes or even more. That is why it is not possible to compute these tasks on a single machine because there are not enough hardware resources. These are the reasons why parallelization of computational tasks become more and more important. In [?], Wangl et al present a development platform based on virtual clusters. Virtual clusters are a kind of virtual machines provided by cloud computing. The platform presented in [?] provides techniques for parallel data mining based on these virtual clusters. To measure the performance of the system, Wangl et al used a PSO (Particle Swarm Optimization) algorithm which was parallelized. The platform reduces the computing time of the swarm and improves the cluster quality. To use the advantages of a parallel architecture a corresponding architecture is required.
4 IV. MAIN APPROACHES This section analyses and describes the implementation details and practical issues of Big Data analysis techniques. The focus is on algorithms for associated rule mining and cluster analysis. The last part discusses a technique for distributed and parallel analysis of large data sets. The Subsection IV-A2 discusses the Apriori algorithm, Subsection IV-A3 discusses the Frequent Pattern (FP)-Growth algorithm and Subsection IV-B1 discusses the K-means and K-means++ algorithm. Finally Subsection IV-C discusses the MapReduce framework. A. Associated rule mining Association mining can be used to identify items that are related to other items. A concrete daily life example could be the analysis of baskets in an online shop or even in the supermarket. Because each customer purchases different combinations of products at different times there are questions that can be addressed with associated rule mining: Why do customers buy certain products? Who are the customers? (Students, families,...) Which products might benefit from advertising? The calculated rules can be used to restructure the market or to send special promotions to certain types of customers. This chapter discusses the Apriori and the FP-Growth algorithm. Both algorithm often provide the basis for further specialized algorithms. A common strategy, which is used by association rule mining algorithms, is to disjoin the problem in two tasks: 1) Generation of frequent item sets: The task is to find item sets that satisfy a minimum support that is usually defined by the user s code. 2) Generation of rules: The task is to find confidence rules using the frequent item sets. The Confidence is calculated as the relative percentage of records which contain both A and B. It determines how frequently items in B appear in a transaction that contains A: conf(a = B) = Support(A B) Support(A) Additionally the frequency of the item set is defined as the number of transactions in S that contains the item set I. L k denotes the frequent item sets of length k [?]. The example on Page 5 shows an daily situation where association rule mining is used. 2) Apriori algorithm: This section discusses the Apriori algorithm which is one of the most common algorithms for association rule mining. The Apriori algorithm is using the prior knowledge to mine frequent item sets in a database or stream. It is actually a layer-by-layer iterative search algorithm, where the item set k is used to analyze the item set k + 1. The algorithm is based on a simple principle: If an item set is frequent, then all subsets of this item set are also frequent. To use the algorithm, the user specifies the minimum support (min sup), the data source (S) and the minimum confidence (min conf) (See Chapter IV-A1 for terminology). The first task is to extract frequent item sets. Extract frequent item sets: Before the iteration starts, the algorithm scans the database to find the number of each item. This item has to satisfy the min sup input. Figure 4 shows the pseudocode for the implementation of the Apriori algorithm. Figure 3 shows the according activity diagram using an UML (Unified Modeling Language) activity diagram. The Apriori algorithm (see Chapter IV-A2) and the FP-Growth algorithm (see Chapter IV-A3) are both based on this principle. 1) Terminology: The following terminology has been taken from [?], [?], [?]. These definitions were adapted in order to discuss the algorithms of this chapter more accurately. { Let S be a stream } or a database with n elements. Then I = i1, i 2,..., i n is an item set with n items. So an item set could be a basket in a supermarket containing n elements. Furthermore an association rule is of the form A = B where A I and B I, A B =. So the rule mining can extract a trend such as when stocks of credit institutes go down, companies for banking equipment follow. Using these rules there are no restrictions on the number of items which are used for such a rule [?]. The Support of an item set is the percentage of records which contain the specific item set. Assume that φ is the frequency of occurrence of an item set. Then the support of an item set A is: sup(a) = φ(a) S Fig. 3. [?] Activity diagram - Apriori frequent item set generation - Notation:
5 1: procedure APRIORI FREQUENTITEMSETS(min sup, S) 2: L 1 itemsets 3: for k = 2; L k 1 ; k + + do 4: C k = apriorigen(l k 1 ) Create the candidates 5: for each c C k do 6: c.count 0 7: end for 8: for each I S do 9: C r subset(c k, I) Identify candidates that belong to I 10: for each c C r do 11: c.count + + Counting the support values 12: end for 13: end for 14: if c.count min sup then 15: L k = L k c 16: end if 17: end for 18: return L k 19: end procedure Fig. 4. Apriori algorithm - Based on [?], [?] The algorithm outputs a set of itemsets I with support(i) min sup. The function apriorigen in line 4 is the part where the set of candidates is created. It takes L k 1 as an argument. This function is performed in two steps: 1) Join: Generation of a new candidate itemset C k. 2) Prune: Elimination of item sets with support(i j ) < min sup With this it is assumed that the items are ordered, e.g. in an alphabetical way. Assume that there are distinct pairs of sets in L k 1 = {{a 1, a 2,..., a k 1 }, {b 1, b 2,..., b k 1 },..., {n 1, n 2,..., n k 1 }} with a i b i... n i, 1 i k 1. Then the join-step joins all k-1 itemsets that differ by only the last element. The following SQL-Statement shows a way how to select those items: select p.item 1, p.item 2,..., p.item k 1, q.item k 1 from L k 1 p, L k 1 q where p.item 1 = q.item 2,..., p.item k 2 = q.item k 2, p item k 1 < q.item k 1 The line 8 to 13 (Figure 4) are used for the support counting. This is done by comparing each transaction I with the candidates C k (line 9) and to increment the support. This is a very expensive tasks specially if the number of transactions and candidates is very large. The result of the statement is added to the set C k. Now C k is used to proceed with the prune step. In the prune step, the set L k+1 is generated using the candidates of the set C k by pruning those combinations of items which can not satisfy the min sup value (Line 14). After the For-loop in line 3 terminates, the algorithm outputs a set of item sets were the support is greater than the value min sup [?], [?], [?]. With this it is possible to generate association rules. Generation of association rules: The generation of the association rules is similar to the generation of the frequent item sets and it is based on the same principle. Figure 5 and Figure 6 show the pseudocode for the implementation of the Apriori algorithm. 1: procedure APRIORI RULEGEN(min conf, s) 2: for each itemsetl k do 3: H 1 = {i i f k } Rules with 1 item consequence 4: Genrules(L k, H 1, min conf) 5: end for 6: end procedure Fig. 5. Apriori rule generation - Based on [?] 1: procedure GENRULES(L k, H 1, min conf) 2: if k > m + 1 then 3: H m+1 = apriorigen(h m ) 4: for each h m+1 H m+1 do 5: conf = Support(L k )/Support(L k h m+1 ) 6: if conf min conf then 7: output(l k h m+1 ) h m+1 8: else 9: H m+1 H m+1 \h m+1 10: end if 11: end for 12: Genrules(L k, H 1, min conf) 13: end if 14: end procedure Fig. 6. Apriori Genrules - Based on [?] First, all confidence rules are extracted using the frequent item sets. These rules are used to generate new candidate rules. For example, if the rules {a, b, c} {d} and {a, f, c} {g} are rules that fit the minimum confidence then there will be a candidate rule {a, c} {d, g}. This is done by merging the result of both rules. The new rule has to satisfy the min conf value, which was specified by the user. The generation of such rules is done iteratively. The function apriorigen (Figure 6, line 3) is used again to generate candidates for new rules. Then the set of candidates is checked for the min conf value (Figure 6, line 6). For this the same assumption as in the generation of the frequent item sets is used. So if there is a new rule that does not satisfy this assumption, all subsets of that rule won t satisfy the assumption as well. Because of this the rule can be removed (Figure 6, line 9). Example - Association rule mining: Usually the algorithms for associated rule mining are used for basket analysis in online stores. To give an example, assume following transactions: Assume the rule Bread = Butter. To compute the confidence for this rule, the support it the item set is needed:
6 sup(bread = Butter) = Transaction Item T1 {Coffee, P asta, Milk} T2 {P asta, Milk} T3 {Bread, Butter} T4 {Coffee, Milk, Butter} T5 {Milk, Bread, Butter} TABLE I. TRANSACTIONS φ({bread, Butter}) S Then the confidence for this rule is computed as follow: = 2 5 = 40% Support(Bread Butter) conf(bread = Butter) = Support(Bread) = 40% 40% = 100% So the rule Bread = Butter has a confidence of 100%. So if there is bread in a basket, the probability that the same basket also contains butter is about 100%. Now assume the rule Butter = M ilk. Again the support value has to be calculated: sup(butter = Milk) = φ({butter, Milk}) S = 2 5 = 40% The calculation of the confidence is done as above: Support(Butter Milk) conf(butter = Milk) = Support(Butter) = 40% 60% 66% So the rule Butter = Milk has a confidence of 66%. So if there is butter in a basket, the probability that the same basket also contains milk is about 66%. Using such rules to identify trends can be critical because a trend might not hold for long. If the streaming data outputs rules, the reaction of the data analysts is to increase the support and confidence values to obtain fewer rules with stronger confidence, but rules may have already become outdated by the end of that period. An example for this could be sales of ice-cream during the hottest month in the summer. After the trend is gone, there will be no sales opportunity. Another example could be the stock market, which was already mentioned in Chapter II-B. 3) Frequent Pattern (FP)-Growth algorithm: The Frequent Pattern (FP)-Growth method is used with databases and not with streams. The Apriori algorithm needs n + 1 scans if a database is used, where n is the length of the longest pattern. By using the FP-Growth method, the number of scans of the entire database can be reduced to two. The algorithm extracts frequent item sets that can be used to extract association rules. This is done using the support of an item set. The terminology, that is used for this algorithm is described in chapter IV-A1. The main idea of the algorithm is to use a divide and conquer strategy: Compress the database which provides the frequent sets; then divide this compressed database into a set of conditional databases, each associated with a frequent set and apply data mining on each database??. To compress the data source, a special data structure called the FP-Tree is needed [?]. The tree is used for the data mining part. Finally the algorithm works in two steps: 1) Construction of the FP-Tree 2) Extract frequent item sets a) Construction of the FP-Tree: The FP-Tree is a compressed representation of the input. While reading the data source each transaction t is mapped to a path in the FP-Tree. As different transaction can have several items in common, their path may overlap. With this it is possible to compress the structure. Figure 7 shows an example for the generation of an FP-tree using 10 transactions. Related to the example on Page 5, an item like a, b, c or d could be an item of a basket e.g. a product which was purchased in a supermarket. TID Items 1 {a,b} 2 {b,c,d} 3 {a,c,d,e} 4 {a,d,e} 5 {a,b,c} 6 {a,b,c,d} 7 {a} 8 {a,b,c} 9 {a,b,d} 10 {b,c,e} (i) After reading TID =1 (ii) After reading TID =2 (iii) After reading TID =3 (iv) After reading TID =10 Fig. 7. Construction of an FP-tree - Based on [?] The FP-Tree is generated in a simple way. First a transaction t is read from the database. The algorithm checks whether the prefix of t maps to a path in the FP-Tree. If this is the case the support count of the corresponding nodes in the tree are incremented. If there is no overlapped path, new nodes are created with a support count of 1. Figure 8 shows the corresponding activity diagram using an UML (Unified Modeling Language) activity diagram. Additional a FP-Tree uses pointers connecting between nodes that have the same items creating a singly linked list. These pointers, represented as dashed lines in Figure 7, 9 and 10, are used to access individual items in the tree even faster. The corresponding FP-Tree is used to extract frequent item sets directly from this structure. Each node in the tree contains the label of an item along with a counter that shows the number of transactions mapped onto the given path. In the best case scenario there is only a single node, because all transactions have the same set of items. A worst case
7 item sets ending with e and then with d, c, b and a until the root is reached. Using the pointers, each the paths can be accessed very efficient by following the list. Furthermore each path of the tree can be processed recursively to extract the frequent item sets, so the problem can be divided into smaller subproblems. All solutions are merged at the end. This strategy allows to execute the algorithm parallel on multiple machines [?]. Fig. 8. Activity diagram - Construction of the FP-Tree - Notation: [?] scenario would be a data source where every transaction has a unique set of items. Usually the FP-tree is smaller than the uncompressed one, because many transactions share items. As already mentioned the algorithm has to scan the data source twice. Pass 1 The data set is scanned to determine the support of each item. The infrequent items are discarded and not used in the FP-Tree. All frequent items are ordered based on their support. Pass 2 The algorithm does the second pass over the data to construct the FP-tree. The following example shows how the algorithm works. According to Figure 7 the first transaction is {a, b}. 1) Because the tree is empty, two nodes a and b with counter 1 are created and the path null a b is created. 2) After {b, c, d} was read, three new nodes b, c and d have to be created. The value for count is 1 and a new path null b c d is created. Because the value b was already in transaction one, there is a new pointer between the b s (dashed lines). 3) The transaction {a, c, d, e} overlaps with transaction one, because of the a in the first place. The frequency count for a will be incremented by 1. Additional pointers between the c s and d s are added. After each transaction was scanned, a full FP-Tree is created (Figure 7-iv). Now the FP-Growth algorithm uses the tree to extract frequent item sets. b) Extract frequent item sets: A bottom-up strategy starts with the leaves and moves up to the root using a divide and conquer strategy. Because every transaction is mapped on a path in the FP-Tree, it is possible to mine frequent item sets ending in a particular item, for example e or d. So according to Figure 9, the algorithm first searches for frequent Fig. 9. Subproblems - Based on [?] The FP-Growth algorithm finds all item sets ending with a specified suffix using the divide and conquer strategy. Assume the algorithm analyzes item sets ending with e. To do so, first the item set e has to be frequent. This can be done using the corresponding FP-Tree ending in e (Figure 10 (a)). If it is frequent, the algorithm has to solve the subproblem of finding frequent item sets ending in de, ce, be and ae. These subproblems are solved using the conditional FP-Tree (Figure 10 (b)). The following example (Figure 10) shows how the algorithm solves the subproblems with the task of finding frequent item sets ending with e [?]. Assume the minimum support is set to two. 1) The first step is to collect the initial prefix path (Figure 10 a) 2) From this prefix path the support count is calculated by adding all support counts with node e. In the example the support count is 3.. 3) Because 3 2 = min sup the algorithm has to solve the subproblems of finding frequent item sets ending with de, be, ce and ae. To solve the subproblems the prefix path has to be converted into a conditional FP- Tree ( Figure 10 b). This tree is used to find frequent item sets ending with a specific suffix. a) Update the support count along the prefix path that don t contain e. Consider the path far right of the tree: null b : 2 c : 2
8 e : 1. This path includes the transaction {b, c} that doesn t contain the item e. Because of this the count along this prefix path has to be 1. b) The node e has to be removed, because the support counts have been updated to reflect only transactions that contains e. The subproblems of finding item sets ending in de, ce, be and ae no longer need information about the node e. c) Because the support counts were updated, there might be some prefix paths that are no longer frequent. According to Figure 10 (a) the node b appears only once with support equal to 1. It follows that there is only one transaction that contains b and e. So be is not frequent and can be ignored. 4) The tree 10 b is used to solve the subproblems of finding frequent item sets ending with de, ce and ae. Consider the subproblem of finding frequent item sets ending with de. A new prefix path tree is needed (Figure 10 c). After the frequency counts for d were updated, the support count for {d, e} is equal to 2 and it fits with the defined conditions. Next the conditional FP-tree for de is constructed using the method as from step 3. Figure 10 d shows the conditional FP-tree for de. This tree contains only one item a. The support is 2 which fits the conditions. The algorithm extracts the item set {a, d, e} and this subproblem is completely processed. 5) The algorithm starts with the next subproblem. No expensive computation of candidates. Using a divide and conquer strategy. Scanning the data source only twice. It can be assumed that the FP-Growth algorithm is more efficient. The next chapter IV-A4 compares the Apriori algorithm and the FP-Growth algorithm. 4) Efficiency: To compare both algorithms, two databases with 10,000 and 100,000 records are used. Figure 11 shows the total runtime. The FP-growth method is faster than the Apriori algorithm because the FP algorithm scans the database only twice. Fig. 11. Total runtime [?] Fig. 12. Total runtime [?] From figure 12 we can see that the FP-growth algorithm uses less memory than the Apriori algorithm, because the FP method tries to compress the structure [?]. Fig. 10. Example of FP-Growth algorithm - Based on [?] This example demonstrates that the runtime depends on the compression of the data set. The FP-Growth algorithm has some advantages compared to the Apriori algorithm: Compressed data structure. B. Clustering Cluster analysis is the task of classification of similar objects into groups or classes. By this the classes are defined during the clustering. An area of use could be marketing. Here clustering can discover groups in all customers to use these groups for targeted marketing. Cluster analysis is a very important area in data mining. This chapter discusses the K-Means algorithm and a variation called the K-Means++ algorithm. Although the first version of the K-Means algorithm was published in 1964 it is still one of the most important methods of cluster analysis. It is the basis for further algorithms that often improve the accuracy or the run time.
9 1) K-means algorithm: Clustering is the task of dividing data into groups of similar objects. This task can be solved in various ways and there is no specific algorithm to solve this task. There are two main methods of clustering: 1) Hierarchical: One cluster is built at the beginning. Iteratively points are added to existing clusters or a new one is created. 2) Centroid based: A cluster is represented by a central point which couldn t be a part of the data set. The K-means algorithm is one of the most common techniques used for clustering. It is a kind of learning algorithm and it is centroid based. The goal of the K-Means algorithm is to find the best division of an arbitrary data set in k classes or clusters, with the feature that the distance between the members of a cluster and its corresponding centroid is minimized. To calculate the distance it is possible to use various metrics. In the standard implementation of the K- means method a partition of the data set with n entities into k sets is given to minimize the within-cluster sum of squares (WCSS). k n ( x j i c j ) j=1 i=1 The expression x j i c j describes the Euclidean distance between an entity and the cluster s centroid. This implementation is also known as Lloyd s algorithm. It provides a solution that can be trapped at a local minimum and there is no guarantee that it corresponds to the best possible solution. The accuracy and running time of K-means clustering depend heavily on the position of the initial cluster center.[?] The initial K-means implementation requires three parameters: Number of clusters k, a distance metric and the cluster initialization There are different implementations of the K-means method. The classic K-means clustering algorithm just takes the parameter k. It uses the Euclidean distance and the cluster initialization is done with the first items of the input. The algorithm creates k clusters by assigning the data to its closest cluster mean using the distance metric. Using all data of a cluster the new mean is calculated. Based on the new means the data have to be reassigned to the means. This loop goes on until the maximum value for the iteration is reached or the new calculated means doesn t move any more. In fact it is a heuristic algorithm. The basic k-means algorithm is as follow [?]: 1) Select k entities as the initial centroids. 2) (Re)Assign all entities to their closest centroids. 3) Recompute the centroid of each newly assembled cluster. 4) Repeat step 2 and 3 until the centroids do not change or until the maximum value for the iterations is reached. Figure 13 shows an exemplary result for a K-means execution: The purple x represents a centroid of a cluster. Fig. 13. Exemplary K-means result 2) K-means++ algorithm: The K-means converges to an optimum without guarantee that it is the global one. There may often be a large number of local optima depending on the total number of entities n, the number of clusters k and the original layout of the entities. Because of this the choice of the starting values for the algorithm is critical, since different parameters can lead to different optima [?], [?]. According to Aloise et al.[?], the K- means clustering problem is NP-hard in the Euclidean space even for k = 2 and it has a worst-case runtime of O(2 Ω(n) ). The K-means algorithm is the starting point for many further algorithms. Because the basic decision version of the K-means algorithm is NP-hard there are additional implementations that are slightly modified [?], [?]. A way to compute NP-hard problems is to compute an optimized solution. Based on K-means Arthur and Vassilvitskii described the serial K-means++ method [?] as an approximation algorithm for the K-means to improve the theoretical runtime. The K-means++ method selects the first centroid at random. The subsequent centroids are selected according to the minimum probable distance that separates the centroid from the others. Let D(x) denote the shortest distance from an entity x of the data source X to the closest center which has already been chosen. Then the K-means++ is as follows: 1) Take one center c 1, chosen uniformly at random from X 2) Take a new center c x, choosing x X with probability D(x)2 3) x X D(x)2 Repeat step 2, until k centers have been taken 4) Proceed as with the standard K-means algorithm The step 2 is also called D 2 weighting. Although the K-means++ method needs more time for the steps 1-3, the step 4 converges much faster than the K-means method without the initial selection. The main idea is to choose the centers in a controlled fashion, where the current set of chosen centers will stochastically influence the choice
10 of the next center [?]. According to Arthur et al.[?] the K-means++ is O(log(k)) competitive to the optimal solutions where k is the number of clusters. C. Distributed computation - MapReduce algorithm Chapter II discussed the research at CERN. Because of the huge amount of data, CERN invented the Worldwide LHC Computing Grid (WLCG) to perform the analysis that large data set. Google s MapReduce framework can be used for generating and processing large data sets. It allows the development of scalable parallel applications to process Big Data using a cluster or a grid [?]. For this the data can be stored in a structured (database) or unstructured way (filesystem). A typical MapReduce computation processes many terabytes of data on thousands of machines [?]. 1) Programming model: The programmer expresses the computation of the two functions map and reduce. The MapReduce framework contains another function that is called shuffle. This function has no user code and it is a generic step which is executed after the users map function. The map and reduce functions are written by the user. The function map takes an input pair and produces a set of intermediate key/value pairs. The input is used by the MapReduce framework to group the values by the intermediate keys. This is done by the function shuffle. Those aggregated values are passed to the reduce function. Because of the arbitrary user code it outputs data which is specified by the user. Usually it uses the given values to form a possibly smaller set of values. Typically there is just one result at the end. [?] 2) Execution overview: Figure 14 shows the complete flow if user code calls the MapReduce framework. Input files Map phase (1)fork User Programm (1)fork Master (2) assign (2) assign (3) read (4) local write (5) RPC Intermediate files (7) return (1)fork Worker for blue keys Worker for red keys Worker for yellow keys Shuffle Reduce Output files Fig. 14. MapReduce Computational Model - Based on [?], [?] When the user calls the MapReduce function in the user code, the following sequence of action occurs [?], [?]: 1) The MapReduce framework splits the specified set of input files I into m pieces, I j {I 1,.., I m }. Then it starts up copies of the users program on various cluster of machines and an additional copy called Master. 2) All copies excluding the Master are s that are assigned by the master. The number of s in the Map phase and Reduce phase don t have to be equal, so assume there are M map tasks and R reduce task. The Master picks idle s and assigns Map or Reducer tasks to them. 3) A with a Map task reads content from the corresponding input file I j The intermediate key/value pairs produced by the map function are buffered in memory. 4) Periodically, the buffered pairs are written to local disk, partitioned into R regions. The location for each Map storage location is passed back to the Master. From here the Master is responsible for forwarding these locations to the reduce s. 5) A Reduce uses remote procedure calls to read the buffered data from the local disks of the. After the Reduce has read all intermediate data for its partition, it sorts it by the intermediate key so that all occurrences of the same key are grouped together. 6) The Reduce iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user s Reduce function. The result of the Reduce is an output file which is written to a arbitrary destination. 7) After all Map and Reducer tasks haven been completed, the Master notifies the user program. With this the MapReduce call from the user code returns. 3) Example: Consider the problem of counting words in a single text or on a website. Concerning this problem, the map function is very simple. 1: procedure MAP(key, value) 2: for each word in value.split() do 3: output.collect(word,1); 4: end for 5: end procedure Fig. 15. Map implementation The map function processes one line at a time. It splits the line and emits a key-value pair of <<word>, 1>. Using the line Hello World Bye World the following output is produced: < Hello, 1> < World, 1> < Bye, 1> < World, 1> The Reducer implementation just sums up the values: The output of the job is: <Hello, 1> <World, 2> <Bye, 1>
11 1: procedure REDUCE(key, values) 2: while values.hasn ext() do 3: sum sum + value.get() Sum up the values 4: end while 5: return(key, sum) 6: end procedure Fig. 16. Reduce implementation MapReduce is designed as a batch processing framework and because of this it is not suitable for ad-hoc data analysis. The time to process this kind of analysis is to large and it doesn t allow programmers to perform iterative or on-shot analysis on Big Data [?]. The project Apache Hadoop provides an open source implementation of the MapReduce algorithm [?]. Today this implementation is used by many companies like Fujitsu Limited [?] or Amazon.com Inc. [?]. it is necessary to get a very fast response like in the OLAP approach. For this, Google s BigQuery is one possible solutions. With using tools for ad-hoc analysis, there are no possibilities for a deep analysis of the data set. To provide an ad-hoc analysis with a deep analysis of the data set, what is needed are algorithms that are more efficient and more specialized for a certain domain. Because of this, technology for analyzing Big Data is also an important area in the academic environment. Here the focus is the running time and the efficiency of these algorithms. Furthermore the data volume is still rising, so another probable approach is further parallelization of the computational tasks. On the other side, the running time can also be improved by specialized algorithms like the Apriori algorithm for the analysis of related transactions. V. CONCLUSION This paper introduces the basic principles for Big Data analysis techniques and technology. As the examples have shown, Big Data is a very important topic in daily business. It helps companies to understand their customers and to improve business decisions. Further research in the medical an many other areas wouldn t be possible without Big Data analysis. Due this various requirements, many approaches are needed. This paper focused on Association Rule Mining, Cluster Analysis and Distributed Computing. In the field of Association Rule Mining, the Apriori and FP-Growth algorithms were presented. The Apriori algorithm is the most common one in this area but it has to scan the data source quite often. In addition, the complexity is comparatively high, because of the generation of the candidate item set. In order to improve the scans of the data source, the FP-Growth algorithm was published. Instead of n+1 scans it takes only two. After a special data structure, the FP-Tree, was build the further work can easily be parallelized and executed on multiple machines. The K-Means algorithm is the most common one for cluster analysis. Using a data source it creates k clusters based on the euclidean distance. The results depends on the initial cluster position and the number of clusters. Because the initial clusters are chosen at random, there is no unique solution for a specific problem. However the K-Means++ algorithm improves the runtime by doing a preprocessing before it proceeds as the K-Means algorithm. With this preprocessing the K-Means++ algorithm is guaranteed to find a solutions that is O(log(k)) competitive to the solution of the K-Means algorithm. Finally the MapReduce Framework allows a parallel and distributed processing of Big Data. For this hundred or even thousands of computers are used to process large amount of data. The framework is easy to use, even without knowledge about parallel and distributed systems. All techniques in this paper are used for processing large data sets. Therefore it is usually not possible to get a response in seconds or even minutes. It usually takes multiple minutes, hours or days until the result is computed. For most companies
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Distributed Apriori in Hadoop MapReduce Framework
Distributed Apriori in Hadoop MapReduce Framework By Shulei Zhao (sz2352) and Rongxin Du (rd2537) Individual Contribution: Shulei Zhao: Implements centralized Apriori algorithm and input preprocessing
The basic data mining algorithms introduced may be enhanced in a number of ways.
DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Data mining algorithms have traditionally assumed data is memory resident,
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,
Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.
Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analytics
Machine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
Big Data and Market Surveillance. April 28, 2014
Big Data and Market Surveillance April 28, 2014 Copyright 2014 Scila AB. All rights reserved. Scila AB reserves the right to make changes to the information contained herein without prior notice. No part
Bringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks [email protected] 2015 The MathWorks, Inc. 1 Data is the sword of the
Map Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm
R. Sridevi et al Int. Journal of Engineering Research and Applications RESEARCH ARTICLE OPEN ACCESS Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm R. Sridevi,*
Developing MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce
Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of
16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
Hadoop and Map-reduce computing
Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.
Big Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM
MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM J. Arokia Renjit Asst. Professor/ CSE Department, Jeppiaar Engineering College, Chennai, TamilNadu,India 600119. Dr.K.L.Shunmuganathan
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
A Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
Map-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,
ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)
ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications
Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
Lecture 10 - Functional programming: Hadoop and MapReduce
Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional
BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani
Big Data and Hadoop Sreedhar C, Dr. D. Kavitha, K. Asha Rani Abstract Big data has become a buzzword in the recent years. Big data is used to describe a massive volume of both structured and unstructured
ISSN: 2320-1363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS
CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS A.Divya *1, A.M.Saravanan *2, I. Anette Regina *3 MPhil, Research Scholar, Muthurangam Govt. Arts College, Vellore, Tamilnadu, India Assistant
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 4, April 2015,
Big Systems, Big Data
Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,
Association Rule Mining
Association Rule Mining Association Rules and Frequent Patterns Frequent Pattern Mining Algorithms Apriori FP-growth Correlation Analysis Constraint-based Mining Using Frequent Patterns for Classification
International Journal of Innovative Research in Computer and Communication Engineering
FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
The 3 questions to ask yourself about BIG DATA
The 3 questions to ask yourself about BIG DATA Do you have a big data problem? Companies looking to tackle big data problems are embarking on a journey that is full of hype, buzz, confusion, and misinformation.
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
Big Data Challenges in Bioinformatics
Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres [email protected] Talk outline! We talk about Petabyte?
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
Analytics on Big Data
Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca
Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?
Big Data with Rough Set Using Map- Reduce
Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,
Testing the In-Memory Column Store for in-database physics analysis. Dr. Maaike Limper
Testing the In-Memory Column Store for in-database physics analysis Dr. Maaike Limper About CERN CERN - European Laboratory for Particle Physics Support the research activities of 10 000 scientists from
Map Reduce / Hadoop / HDFS
Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview
Chapter 20: Data Analysis
Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract
W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON BIG DATA MANAGEMENT AND ITS SECURITY PRUTHVIKA S. KADU 1, DR. H. R.
Big Application Execution on Cloud using Hadoop Distributed File System
Big Application Execution on Cloud using Hadoop Distributed File System Ashkan Vates*, Upendra, Muwafaq Rahi Ali RPIIT Campus, Bastara Karnal, Haryana, India ---------------------------------------------------------------------***---------------------------------------------------------------------
Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014
Big Data Analytics An Introduction Oliver Fuchsberger University of Paderborn 2014 Table of Contents I. Introduction & Motivation What is Big Data Analytics? Why is it so important? II. Techniques & Solutions
Open source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
Using an In-Memory Data Grid for Near Real-Time Data Analysis
SCALEOUT SOFTWARE Using an In-Memory Data Grid for Near Real-Time Data Analysis by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 IN today s competitive world, businesses
Hadoop Cluster Applications
Hadoop Overview Data analytics has become a key element of the business decision process over the last decade. Classic reporting on a dataset stored in a database was sufficient until recently, but yesterday
BIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
Big Data: Tools and Technologies in Big Data
Big Data: Tools and Technologies in Big Data Jaskaran Singh Student Lovely Professional University, Punjab Varun Singla Assistant Professor Lovely Professional University, Punjab ABSTRACT Big data can
AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW
AGENDA What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story Hadoop PDW Our BIG DATA Roadmap BIG DATA? Volume 59% growth in annual WW information 1.2M Zetabytes (10 21 bytes) this
So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)
Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
DATA CLUSTERING USING MAPREDUCE
DATA CLUSTERING USING MAPREDUCE by Makho Ngazimbi A project submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science Boise State University March 2009
Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities
Technology Insight Paper Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities By John Webster February 2015 Enabling you to make the best technology decisions Enabling
Project Report BIG-DATA CONTENT RETRIEVAL, STORAGE AND ANALYSIS FOUNDATIONS OF DATA-INTENSIVE COMPUTING. Masters in Computer Science
Data Intensive Computing CSE 486/586 Project Report BIG-DATA CONTENT RETRIEVAL, STORAGE AND ANALYSIS FOUNDATIONS OF DATA-INTENSIVE COMPUTING Masters in Computer Science University at Buffalo Website: http://www.acsu.buffalo.edu/~mjalimin/
Introduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant
Binary Coded Web Access Pattern Tree in Education Domain
Binary Coded Web Access Pattern Tree in Education Domain C. Gomathi P.G. Department of Computer Science Kongu Arts and Science College Erode-638-107, Tamil Nadu, India E-mail: [email protected] M. Moorthi
Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,[email protected]
Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,[email protected] Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,
International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop
ISSN: 2454-2377, October 2015 Big Data and Hadoop Simmi Bagga 1 Satinder Kaur 2 1 Assistant Professor, Sant Hira Dass Kanya MahaVidyalaya, Kala Sanghian, Distt Kpt. INDIA E-mail: [email protected]
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
Study of Data Mining algorithm in cloud computing using MapReduce Framework
Study of Data Mining algorithm in cloud computing using MapReduce Framework Viki Patil, M.Tech, Department of Computer Engineering and Information Technology, V.J.T.I., Mumbai Prof. V. B. Nikam, Professor,
Data Mining in the Swamp
WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2
Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue
MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context
MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
Data Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.
The Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 [email protected] ABSTRACT Many cluster owners and operators have
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
Data Mining Applications in Manufacturing
Data Mining Applications in Manufacturing Dr Jenny Harding Senior Lecturer Wolfson School of Mechanical & Manufacturing Engineering, Loughborough University Identification of Knowledge - Context Intelligent
Big Data and Scripting map/reduce in Hadoop
Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb
Laboratory Module 8 Mining Frequent Itemsets Apriori Algorithm
Laboratory Module 8 Mining Frequent Itemsets Apriori Algorithm Purpose: key concepts in mining frequent itemsets understand the Apriori algorithm run Apriori in Weka GUI and in programatic way 1 Theoretical
Analysis of MapReduce Algorithms
Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 [email protected] ABSTRACT MapReduce is a programming model
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
Data processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
Frequent item set mining
Frequent item set mining Christian Borgelt Frequent item set mining is one of the best known and most popular data mining methods. Originally developed for market basket analysis, it is used nowadays for
Are You Ready for Big Data?
Are You Ready for Big Data? Jim Gallo National Director, Business Analytics February 11, 2013 Agenda What is Big Data? How do you leverage Big Data in your company? How do you prepare for a Big Data initiative?
CS 378 Big Data Programming. Lecture 2 Map- Reduce
CS 378 Big Data Programming Lecture 2 Map- Reduce MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is processed But viewed in small increments
Using Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.
Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
Big Data. Fast Forward. Putting data to productive use
Big Data Putting data to productive use Fast Forward What is big data, and why should you care? Get familiar with big data terminology, technologies, and techniques. Getting started with big data to realize
Mammoth Scale Machine Learning!
Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes
