Big Data Analytics beyond Map/Reduce

Size: px
Start display at page:

Download "Big Data Analytics beyond Map/Reduce"

Transcription

1 DEUTSCH-FRANZÖSISCHE SOMMERUNIVERSITÄT FÜR NACHWUCHSWISSENSCHAFTLER 2011 CLOUD COMPUTING : HERAUSFORDERUNGEN UND MÖGLICHKEITEN UNIVERSITÉ D ÉTÉ FRANCO-ALLEMANDE POUR JEUNES CHERCHEURS 2011 CLOUD COMPUTING : DÉFIS ET OPPORTUNITÉS Big Data Analytics beyond Map/Reduce Prof. Dr. Volker Markl TU Berlin

2 Shift Happens! Our Digital World! Video courtesy of Michael Brodie, Chief Scientist, Verizon Original "Shift Happens" video by K. Fisch and S. McLeod Original focuses on shift in society, aimed at teachers education Michael Brodie focuses on shift in/because of the digital world 7/25/2011 DIMA TU Berlin 2

3 Data Growth and Value About data growth: $600 to buy a disk drive that can store all of the world s music 5 billion mobile phones in use in billion pieces of content shared on Facebook every month 40% projected growth in global data per year About the value of captured data: 250 billion Euro potential value to Europe s public sector administration 60% potential increase in retailers operating margins possible with big data 140, ,000 more deep analytical talent positions needed Source: Big Data: The next frontier for innovation, competition and productivity (McKinsey) 7/25/2011 DIMA TU Berlin 3

4 Big Data Data have swept into every industry and business function important factor of production exabytes of data stored by companies every year much of modern economic activity could not take place without that Big Data creates value in several ways provides transparency enables experimentation brings about customization and tailored products supports human decisions triggers new business models Use of Big Data will become a key basis of competition and growth companies failing to develop their analysis capabilities will fall behind Source: Big Data: The next frontier for innovation, competition and productivity (McKinsey) 7/25/2011 DIMA TU Berlin 4

5 Big Data Analytics Data volume keeps growing Data Warehouse sizes of about 1PB are not uncommon! Some businesses produce >1TB of new data per day! Scientific scenarios are even larger (e.g. LHC experiment results in ~15PB / yr) Some systems are required to support extreme throughput in transaction processing Especially financial institutes Analysis Queries become more and more complex Discovering statistical patterns is compute intensive May require multiple passes over the data Performance of single computing cores or single machines is not increasing substantially enough to cope with this development 7/25/2011 DIMA TU Berlin 5

6 Trends and Challenges Trends Massive parallelization Virtualization Service-based computing Web-scale data management Analytics / BI Operational Multi-tenancy Claremont Report Re-architecting DBMS Parallelization Continuous optimization Tight integration Service-based everything Programming Model Combining structured and unstructured data Media Convergence 7/25/2011 DIMA TU Berlin 6

7 Overview Introduction Big Data Analytics Map/Reduce/Merge Introducing the Cloud Stratosphere (PACT and Nephele) Demo (Thomas Bodner, Matthias Ringwald) Mahout and Scalable Data Mining (Sebastian Schelter) 7/25/2011 DIMA TU Berlin 7

8 Map/Reduce Revisited BIG DATA ANALYTICS 7/25/2011 DIMA TU Berlin 8

9 Data Partitioning (I) Partitioning the data means creating a set of disjunct sub-sets Example: Sales data, every year gets its own partition For shared-nothing, data must be partitioned across nodes If it were replicated, it would effectively become a shared-disk with the local disks acting like a cache (must be kept coherent) Partitioning with certain characteristics has more advantages Some queries can be limited to operate on certain sets only, if it is provable that all relevant data (passing the predicates) is in that partition Partitions can be simply dropped as a whole (data is rolled out) when it is no longer needed (e.g. discard old sales) 7/25/2011 DIMA TU Berlin 9

10 Data Partitioning (II) How to partition the data into disjoint sets? Round robin: Each set gets a tuple in a round, all sets have guaranteed equal amount of tuples, no apparent relationship between tuples in one set. Hash Partitioned: Define a set of partitioning columns. Generate a hash value over those columns to decide the target set. All tuples with equal values in the partitioning columns are in the same set. Range Partitioned: Define a set of partitioning columns and split the domain of those columns into ranges. The range determines the target set. All tuples on one set are in the same range. 7/25/2011 DIMA TU Berlin 10

11 Map/Reduce Revisited The data model key/value pairs e.g. (int, string) Functional programming model with 2 nd order functions map: input key-value pairs: output key-value pairs: reduce: input key output key and a list of values and a single value The framework accepts a list outputs result pairs 7/25/2011 DIMA TU Berlin 11

12 Framework Data Flow in Map/Reduce (K m,v m )* (K m,v m ) (K m,v m ) (K m,v m ) MAP(K m,v m ) MAP(K m,v m ) MAP(K m,v m ) Framework (K r,v r )* (K r,v r )* (K r,v r )* (K r,v r *) (K r,v r *) (K r,v r *) REDUCE(K r,v r *) REDUCE(K r,v r *) REDUCE(K r,v r *) Framework (K r,v r ) (K r,v r ) (K r,v r ) (K r,v r )* 7/25/2011 DIMA TU Berlin 12

13 Map Reduce Illustrated (1) Problem: Counting words in a parallel fashion How many times different words appear in a set of files juliet.txt: Romeo, Romeo, wherefore art thou Romeo? benvolio.txt: What, art thou hurt? Expected output: Romeo (3), art (2), thou (2), art (2), hurt (1), wherefore (1), what (1) Solution: Map-Reduce Job map(filename, line) { foreach (word in line) emit(word, 1); } reduce(word, numbers) { int sum = 0; foreach (value in numbers) { sum += value; } emit(word, sum); } 7/25/2011 DIMA TU Berlin 13

14 Map Reduce Illustrated (2) 7/25/2011 DIMA TU Berlin 14

15 Data Analytics: Relational Algebra Base Operators selection ( ) projection ( ) set/bag union ( ) set/bag difference (\ or -) Cartesian product ( ) Derived Operators join ( ) set/bag intersection ( ) division (/) Further Operators de-duplication generalized projection (grouping and aggregation) outer-joins und semi-joins Sort 7/25/2011 DIMA TU Berlin 15

16 Relational Operators as Map/Reduce jobs Selection / projection / aggregation SQL Query: SELECT year, SUM(price) FROM sales WHERE area_code = US GROUP BY year Map/Reduce job: map(key, tuple) { } int year = YEAR(tuple.date); if (tuple.area_code = US ) emit(year, { year => year, price => tuple.price }); reduce(key, tuples) { double sum_price = 0; foreach (tuple in tuples) { sum_price += tuple.price; } emit(key, sum_price); } 7/25/2011 DIMA TU Berlin 16

17 Relational Operators as Map/Reduce jobs Sorting SQL Query: SELECT * FROM sales ORDER BY year Map/Reduce job: map(key, tuple) { } emit(year(tuple.date) DIV 10, tuple); reduce(key, tuples) { emit(key, sort(tuples)); } 7/25/2011 DIMA TU Berlin 17

18 Relational Operators as Map/Reduce jobs UNION SQL Query: SELECT phone_number FROM employees UNION SELECT phone_number FROM bosses Map/Reduce job needs two different mappers: map(key, employees_phonebook_entry) { } emit(employees_phonebook_entry.number, ``); map(key, bosses_phonebook_entry) { emit(bosses_phonebook_entry.number, ``); } reduce(phone_number, tuples) { emit(phone_number, ``); } 7/25/2011 DIMA TU Berlin 18

19 Relational Operators as Map/Reduce jobs INTERSECT SQL Query: SELECT first_name FROM employees INTERSECT SELECT first_name FROM bosses Map/Reduce job needs two different mappers: map(key, employee_listing_entry) { } emit(employee_listing_entry.first_name, `E`); map(key, boss_listing_entry) { emit(bosses_listing_entry.first_name, `B`); } reduce(first_name, markers) { if (`E` in markers and `B` in markers) { emit(first_name, ``); } } 7/25/2011 DIMA TU Berlin 19

20 The Petabyte Sort Benchmark Benchmark to test the performance of distributed systems Goal: Sort one Petabyte of 100 byte numbers Implementation in Hadoop: Range-Partitioner that splits the data in equal ranges (one for each participating node) Sort is basically "Range partitioning sort" as described earlier 7/25/2011 DIMA TU Berlin 20

21 Petabyte sorting benchmark Per node: 2 quad core 2.5ghz, 4 SATA disks, 8G RAM (upgraded to 16GB before petabyte sort), 1 Gigabit Ethernet. Per Rack: 40 nodes, 8 gigabit Ethernet uplinks. 7/25/2011 DIMA TU Berlin 21

22 Cluster Utilization during Sort 7/25/2011 DIMA TU Berlin 22

23 Map/Reduce Revisited JOINS IN MAP/REDUCE 7/25/2011 DIMA TU Berlin 23

24 Symmetric Fragment-and-Replicate Join (II) Nodes in the Cluster 7/25/2011 DIMA TU Berlin 24

25 Asymmetric Fragment-and-Replicate Join We can do better, if relation S is much smaller than R. Idea: Reuse the existing partitioning of R and replicate the whole relation S to each node. Cost: p * B(S) transport??? local join Asymmetric Fragment-and-replicate Join is a special case of the Symmetric Algorithm with m=p and n=1. The Asymmetric Fragment-and-replicate join is also called Broadcast Join 7/25/2011 DIMA TU Berlin 25

26 Broadcast Join Equi-Join: L(A,X) R(X,C) assumption: L << R Idea broadcast L to each node completely before the map phase begins by utilities, like Hadoop's distributed cache or mappers read L from the cluster filesystem at startup M M M L R R R Mapper only over R step 1: read assigned input split of R into a hash-table (build phase) step 2: scan local copy of L and find matching R tuples (probe) step 3: emit each such pair Alternatively read L into Hash-Table, then read R and probe No need for partition / sort / reduce processing Mapper outputs the final join result 7/25/2011 DIMA TU Berlin 26

27 Equi-Join: L(A,X) assumption: L < R Mapper L(A,X) R(X,C) R(X,C) h(key) % n identical processing logic for L and R emit each tuple once read the intermediate key is a pair of the value of the actual join key X L R an annotation identifying to which relation the tuple belongs to (L or R) Partition and sort Reduce Repartition Join build L R R R L R L R L R M M M L R L R partition by join key hash value input is ordered first on the join key, then on the relation name output: a sequence of L(i), R(i) blocks of tuples for ascending join key i collect all L-tuples for the current L(i) block in a hash map combine them with each R-tuple of the corresponding R(i)-tuple block 7/25/2011 DIMA TU Berlin 27

28 Multi-Dimensional Partitioned Join Equi-Join: D1(A,X) D2(B,Y) F(C,X,Y) star-schema with fact table F and dimensions Di F D1 D2 D1 Fragment D2 D1 and D2 are partitioned independently the partitions for F are defined as D1 x D2 Replicate for F-tuple f the partition is uniquely defined as (hash(f.x), hash(f.y)) for D1-tuple d1 there is one degree of freedom (d1.y is undefined) D1-tuples are thus replicated for each possible y value symmetric for D2 Reduce find and emit (f, d1, d2) pairs depending on the input sorting, different join strategies are possible 7/25/2011 DIMA TU Berlin 29

29 time Joins in Hadoop nodes Asym. = Multi-Dimensional Partitioned Join selectivity 7/25/2011 DIMA TU Berlin 31

30 Parallel DBMS vs. Map/Reduce Parallel DBMS Map/Reduce Schema Support Indexing Programming Model Stating what you want (declarative: SQL) Presenting an algorithm (procedural: C/C++, Java, ) Optimization Scaling Fault Tolerance Limited Good Execution Pipelines results between operators Materializes results between phases 7/25/2011 DIMA TU Berlin 32

31 Simplified Relational Data Processing on Large Clusters MAP-REDUCE-MERGE 7/25/2011 DIMA TU Berlin 33

32 Map-Reduce-Merge Motivation Map/Reduce does not directly support processing multiple related heterogeneous datasets difficulties and/or inefficiency when one must implement relational operators like joins Map-Reduce-Merge adds a merge phase that Goal: efficiently merge data already partitioned and sorted (or hashed) Map-Reduce-Merge workflows are comparable to RDBMS execution plans Can more easily implement parallel join algorithms map: reduce: merge: ( k 2, ( k 2, ( k1, v1) v2) ( k 2,[ v2]) v3 ) ( k 2, v 3 ),( k3, v 4 ) ( k 4, v5) 7/25/2011 DIMA TU Berlin 34

33 Introducing THE CLOUD 7/25/2011 DIMA TU Berlin 35

34 In the Cloud 7/25/2011 DIMA TU Berlin 36

35 "The interesting thing about cloud computing is that we've redefined cloud computing to include everything that we already do. I can't think of anything that isn't cloud computing with all of these announcements. The computer industry is the only industry that is more fashion-driven than women's fashion. Maybe I'm an idiot, but I have no idea what anyone is talking about. What is it? It's complete gibberish. It's insane. When is this idiocy going to stop? "We'll make cloud computing announcements. I'm not going to fight this thing. But I don't understand what we would do differently in the light of cloud." 7/25/2011 DIMA TU Berlin 37

36 Steve Ballmer s Vision of Cloud Computing 7/25/2011 DIMA TU Berlin 38

37 What does Hadoop have to do with Cloud? A few months back, Hamid Pirahesh and I were doing a roundtable with a customer of ours, on cloud and data. We got into a set of standard issues -- data security being the primary but when the dialog turned to Hadoop, a person raised his hands and asked: What has Hadoop got to do with cloud?" I responded, somewhat quickly perhaps, "Nothing specific, and I am willing to have a dialog with you on Hadoop in and out of the cloud context", but it got me thinking. Is there a relationship, or not? 7/25/2011 DIMA TU Berlin 39

38 Re-inventing the wheel - or not? 7/25/2011 DIMA TU Berlin 40

39 Parallel Analytics in the Cloud beyond Map/Reduce STRATOSPHERE 7/25/2011 DIMA TU Berlin 41

40 The Stratosphere Project * Use-Cases Scientific Data Life Sciences Linked Data Explore the power of Cloud computing for complex information management applications Database-inspired approach StratoSphere Above the Clouds Query Processor Analyze, aggregate, and query Infrastructure as a Service Textual and (semi-) structured data... Research and prototype a web-scale data analytics infrastructure * FOR 1306: DFG funded collaborative project among TU Berlin, HU Berlin and HPI Potsdam 7/25/2011 DIMA TU Berlin 42

41 1100km, 2km resolution Example: Climate Data Analysis PS,1,1,0,Pa, surface pressure T_2M,11,105,0,K,air_temperature TMAX_2M,15,105,2,K,2m maximum temperature TMIN_2M,16,105,2,K,2m minimum temperature U,33,110,0,ms-1,U-component of wind V,34,110,0,ms-1,V-component of wind QV_2M,51,105,0,kgkg-1,2m specific humidity CLCT,71,1,0,1,total cloud cover (Up to 200 parameters) Analysis Tasks on Climate Data Sets Validate climate models Locate hot-spots in climate models Monsoon Drought Flooding Compare climate models Based on different parameter settings 10TB Necessary Data Processing Operations Filter Aggregation (sliding window) Join Multi-dimensional sliding-window operations Geospatial/Temporal joins Uncertainty 950km, 2km resolution 7/25/2011 DIMA TU Berlin 43

42 Further Use-Cases Text Mining in the biosciences Cleansing of linked open data 7/25/2011 DIMA TU Berlin 44

43 Outline Architecture of the Stratosphere System The PACT Programming Model The Nephele Execution Engine Parallelizing PACT Programs 7/25/2011 DIMA TU Berlin 45

44 Architecture Overview Higher-Level Language JAQL, Pig, Hive Scope, DryadLINQ JAQL? Pig? Hive? Parallel Programming Model Map/Reduce Programming Model PACT Programming Model Execution Engine Hadoop Dryad Nephele Hadoop Stack Dryad Stack Stratosphere Stack 7/25/2011 DIMA TU Berlin 46

45 Data-Centric Parallel Programming Map / Reduce Map Reduce Relational Databases γ Map Map π σ Reduce Reduce σ Schema Free Many semantics hidden inside the user code (tricks required to push operations into map/reduce) Single default way of parallelization Schema bound (relational model) Well defined properties and requirements for parallelization Flexible and optimizable GOAL: Advance the m/r programming model 7/25/2011 DIMA TU Berlin 47

46 Stratosphere in a Nutshell PACT Programming Model Parallelization Contract (PACT) Declarative definition of data parallelism Centered around second-order functions Generalization of map/reduce PACT Compiler Nephele Dryad-style execution engine Evaluates dataflow graphs in parallel Data is read from distributed filesystem Flexible engine for complex jobs Nephele Stratosphere = Nephele + PACT Compiles PACT programs to Nephele dataflow graphs Combines parallelization abstraction and flexible execution Choice of execution strategies gives optimization potential 7/25/2011 DIMA TU Berlin 48

47 Overview Parallelization Contracts (PACTs) The Nephele Execution Engine Compiling/Optimizing Programs Related Work 7/25/2011 DIMA TU Berlin 49

48 Intuition for Parallelization Contracts Map and reduce are second-order functions Call first-order functions (user code) Provide first-order functions with subsets of the input data Define dependencies between the records that must be obeyed when splitting them into subsets Key Value Independent subsets Cp: Required partition properties Map All records are independently processable Input set Reduce Records with identical key must be processed together 7/25/2011 DIMA TU Berlin 50

49 Contracts beyond Map and Reduce Cross Two inputs Each combination of records from the two inputs is built and is independently processable Match Two inputs, each combination of records with equal key from the two inputs is built Each pair is independently processable CoGroup Multiple inputs Pairs with identical key are grouped for each input Groups of all inputs with identical key are processed together 7/25/2011 DIMA TU Berlin 51

50 Parallelization Contracts (PACTs) Second-order function that defines properties on the input and output data of its associated first-order function Data Input Contract First-order function (user code) Output Contract Data Input Contract Specifies dependencies between records (a.k.a. "What must be processed together?") Generalization of map/reduce Logically: Abstracts a (set of) communication pattern(s) For "reduce": repartition-by-key For "match" : broadcast-one or repartition-by-key Output Contract Generic properties preserved or produced by the user code key property, sort order, partitioning, etc. Relevant to parallelization of succeeding functions 7/25/2011 DIMA TU Berlin 52

51 Optimizing PACT Programs For certain PACTs, several distribution patterns exist that fulfill the contract Choice of best one is up to the system Created properties (like a partitioning) may be reused for later operators Need a way to find out whether they still hold after the user code Output contracts are a simple way to specify that Example output contracts: Same-Key, Super-Key, Unique-Key Using these properties, optimization across multiple PACTs is possible Simple System-R style optimizer approach possible 7/25/2011 DIMA TU Berlin 53

52 From PACT Programs to Data Flows function match(key k, Tuple val1, Tuple val2) -> (Key, Tuple) { Tuple res = val1.concat(val2); res.project(...); Key k = res.getcolumn(1); Return (k, res); } PACT code (grouping) User Function invoke(): while (!input2.eof) KVPair p = input2.next(); hash-table.put(p.key, p.value); while (!input1.eof) KVPair p = input1.next(); KVPait t = hash-table.get(p.key); if (t!= null) KVPair[] result = UF.match(p.key, p.value, t.value); output.write(result); end UF1 (map) UF2 (map) UF3 (match) Nephele code (communication) UF4 (reduce) In-Memory Channel compile V1 V2 V3 V4 Network Channel span V4 V3 V3 V1 V2 V4 V3 V3 V1 V2 PACT Program Nephele DAG Spanned Data Flow 7/25/2011 DIMA TU Berlin 54

53 NEPHELE EXECUTION ENGINE 7/25/2011 DIMA TU Berlin 55

54 Nephele Execution Engine Executes Nephele schedules compiled from PACT programs Design goals Exploit scalability/flexibility of clouds Provide predictable performance Efficient execution on cores Flexible fault tolerance mechanisms Inherently designed to run on top of an IaaS Cloud Heterogeneity through different types of VMs Knows Cloud s pricing model VM allocation and de-allocation Network topology inference PACT Compiler Nephele Infrastructure-as-a-Service 7/25/2011 DIMA TU Berlin 56

55 Cloud Controller Persistent Storage Nephele Architecture Standard master worker pattern Workers can be allocated on demand Workload over time Client Public Network (Internet) Compute Cloud Master Private / Virtualized Network Worker Worker Worker 7/25/2011 DIMA TU Berlin 57

56 Structure of a Nephele Schedule Output 1 Task: LineWriterTask.program Output: Task 1 Task: MyTask.program Nephele Schedule is represented as DAG Vertices represent tasks Edges denote communication channels Mandatory information for each vertex Task program Input/output data location (I/O vertices only) Input 1 Task: LineReaderTask.program Input: Optional information for each vertex Number of subtasks (degree of parallelism) Number of subtasks per virtual machine Type of virtual machine (#CPU cores, RAM ) Channel types Sharing virtual machines among tasks 7/25/2011 DIMA TU Berlin 58

57 Internal Schedule Representation Nephele schedule is converted into internal representation Output 1 1 (1) ID: 2 Type: m1.large Explicit parallelization Parallelization range (mpl) derived from PACT Wiring of subtasks derived from PACT Task Task 1 1 (2) ID: 1 Type: m1.small Explicit assignment to virtual machines Specified by ID and type Type refers to hardware profile Input 1 1 (1) 7/25/2011 DIMA TU Berlin 59

58 Execution Stages Stage 1 Output 1 (1) ID: 2 Type: m1.large Issues with on-demand allocation: When to allocate virtual machines? When to deallocate virtual machines? No guarantee of resource availability! Stage 0 Task 1 (2) ID: 1 Type: m1.small Stages ensure three properties: VMs of upcoming stage are available All workers are set up and ready Data of previous stages is stored in persistent manner Input 1 (1) 7/25/2011 DIMA TU Berlin 60

59 Channel Types Stage 1 Output 1 (1) ID: 2 Type: m1.large Stage 0 Task 1 (2) Network channels (pipeline) Vertices must be in same stage In-memory channels (pipeline) Vertices must run on same VM Vertices must be in same stage File channels Vertices must run on same VM Vertices must be in different stages ID: 1 Type: m1.small Input 1 (1) 7/25/2011 DIMA TU Berlin 61

60 Some Evaluation (1/2) Demonstrates benefits of dynamic resource allocation Challenge: Sort and Aggregate Sort 100 GB of integer numbers (from GraySort benchmark) Aggregate TOP 20% of these numbers (exact result!) First execution as map/reduce jobs with Hadoop Three map/reduce jobs on 6 VMs (each with 8 CPU cores, 24 GB RAM) TeraSort code used for sorting Custom code for aggregation Second execution as map/reduce jobs with Nephele Map/reduce compatilibilty layer allows to run Hadoop M/R programs Nephele controls resource allocation Idea: Adapt allocated resources to required processing power 7/25/2011 DIMA TU Berlin 62

61 First Evaluation (2/2) Average instance utilization [%] (a) USR SYS WAIT Network traffic (b) (c) Poor resource (g) utilization! (d) (e) (f) (h) Average network traffic among instances [MBit/s] Average instance utilization [%] (a) (b) (c) (d) (e) USR SYS WAIT Network traffic Automatic VM deallocation (f) (g) (h) Average network traffic among instances [MBit/s] Time [minutes] M/R jobs on Hadoop Time [minutes] M/R jobs on Nephele 7/25/2011 DIMA TU Berlin 63

62 References [WK09] Daniel Warneke, Odej Kao: Nephele: efficient parallel data processing in the cloud. SC-MTAGS 2009 [BEH+10] D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, D. Warneke: Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. SoCC 2010: [ABE+10] A. Alexandrov, D. Battré, S. Ewen, M. Heimel, F. Hueske, O. Kao, V. Markl, E. Nijkamp, D. Warneke: Massively Parallel Data Analysis with PACTs on Nephele. PVLDB 3(2): (2010) [AEH+11] A.Alexandrov, S. Ewen, M. Heimel, Fabian Hüske, et al.: MapReduce and PACT - Comparing Data Parallel Programming Models, to appear at BTW /25/2011 DIMA TU Berlin 64

63 Ongoing Work Adaptive Fault-Tolerance (Odej Kao) Robust Query Optimization (Volker Markl) Parallelization of the PACT Programming Model (Volker Markl) Continuous Re-Optimization (Johann-Christoph Freytag) Validating Climate Simulations with Stratosphere (Volker Markl) Text Analysis with Stratosphere (Ulf Leser) Data Cleansing with Stratosphere (Felix Naumann) JAQL on Stratosphere: Student Project at TUB Open Source Release: Nephele + PACT (TUB, HPI, HU) 7/25/2011 DIMA TU Berlin 65

64 Overview Introduction Big Data Analytics Map/Reduce/Merge Introducing the Cloud Stratosphere (PACT and Nephele) Demo (Thomas Bodner, Matthias Ringwald) Mahout and Scalable Data Mining (Sebastian Schelter) 7/25/2011 DIMA TU Berlin 66

65 The Information Revolution 7/25/2011 DIMA TU Berlin 67

66 Demo Screenshots WEBLOG ANALYSIS QUERY 7/25/2011 DIMA TU Berlin 74

67 Weblog Query and Plan SELECT r.url, r.rank, r.avg_duration FROM Documents d JOIN Rankings r ON r.url = d.url WHERE CONTAINS(d.text, [keywords]) AND r.rank > [rank] AND NOT EXISTS (SELECT * FROM Visits v WHERE v.url = d.url AND v.date < [date]); 7/25/2011 DIMA TU Berlin 75

68 Weblog Query Job Preview 7/25/2011 DIMA TU Berlin 76

69 Weblog Query Optimized Plan 7/25/2011 DIMA TU Berlin 77

70 Weblog Query Nephele Schedule in Execution 7/25/2011 DIMA TU Berlin 78

71 Demo Screenshots ENUMERATING TRIANGLES FOR SOCIAL NETWORK MINING 7/25/2011 DIMA TU Berlin 79

72 Enumerating Triangles Graph and Job 7/25/2011 DIMA TU Berlin 80

73 Enumerating Triangles Job Preview 7/25/2011 DIMA TU Berlin 81

74 Enumerating Triangles Optimized Plan 7/25/2011 DIMA TU Berlin 82

75 Enumerating Triangles Nephele Schedule in Execution 7/25/2011 DIMA TU Berlin 83

76 Scalable data mining APACHE MAHOUT Sebastian Schelter 7/25/2011 DIMA TU Berlin 85

77 Apache Mahout: Overview What is Apache Mahout? An Apache Software Foundation project aiming to create scalable machine learning libraries under the Apache License focus on scalability, not a competitor for R or Weka in use at Adobe, Amazon, AOL, Foursquare, Mendeley, Twitter, Yahoo Scalability time is proportional to problem size by resource size does not imply Hadoop or parallel, although the majority of implementations use Map/Reduce t P R 7/25/2011 DIMA TU Berlin 86

78 Apache Mahout: Clustering Clustering Unsupervised learning: assign a set of data points into subsets (called clusters) so that points in the same cluster are similar in some sense Algorithms K-Means Fuzzy K-Means Canopy Mean Shift Dirichlet Process Spectral Clustering 7/25/2011 DIMA TU Berlin 87

79 Apache Mahout: Classification Classification supervised learning: learn a decision function that predicts labels y on data points x given a set of training samples {(x,y)} Algorithms Logistic Regression (sequential but fast) Naive Bayes / Complementary Naïve Bayes Random Forests 7/25/2011 DIMA TU Berlin 88

80 Apache Mahout: Collaborative Filtering Collaborative Filtering approach to recommendation mining: given a user's preferences for items, guess which other items would be highly preferred Algorithms Neighborhood methods: Itembased Collaborative Filtering Latent factor models: matrix factorization using Alternating Least Squares 7/25/2011 DIMA TU Berlin 89

81 Apache Mahout: Singular Value Decomposition Singular Value Decomposition matrix decomposition technique used to create an optimal low-rank approximation of a matrix used for dimensional reduction, unsupervised feature selection, Latent Semantic Indexing Algorithms Lanczos Algorithm Stochastic SVD 7/25/2011 DIMA TU Berlin 90

82 Comparing implementations of data mining algorithms in Hadoop/Mahout and Nephele/PACT SCALABLE DATA MINING 7/25/2011 DIMA TU Berlin 92

83 Problem description Pairwise row similarity computation Computes the pairwise similarities of the rows (or columns) of a sparse matrix using a predefined similarity function used for computing document similarities in large corpora used to precompute item-itemsimilarities for recommendations (Collaborative Filtering) similarity function can be cosine, Pearson-correlation, loglikelihood ratio, Jaccard coefficient, 7/25/2011 DIMA TU Berlin 93

84 Map/Reduce Map/Reduce Step 1 compute similarity specific row weights transpose the matrix, there by create an inverted index Map/Reduce Step 2 map out all pairs of cooccurring values collect all cooccurring values per row pair, compute similarity value Map/Reduce Step 3 use secondary sort to only keep the k most similar rows PACT 7/25/2011 DIMA TU Berlin 94

85 Comparison Equivalent implementations in Mahout and PACT problem maps relatively well to the Map/Reduce paradigm insight: standard Map/Reduce code can be ported to Nephele/PACT with very little effort output contracts and memory forwards offer hooks for performance improvements (unfortunately not applicable in this particular usecase) 7/25/2011 DIMA TU Berlin 95

86 Problem description K-Means Simple iterative clustering algorithm uses a predefined number of clusters (k) start with a random selection of cluster centers assign points to nearest cluster recompute cluster centers, iterate until convergence 7/25/2011 DIMA TU Berlin 96

87 Mahout Initialization generate k random cluster centers from datapoints (optional) put centers to distributed cache Map find nearest cluster for each data point emit (cluster id, data point) Combine partially aggregate distances per cluster Reduce compute new centroid for each cluster Repeat output converged cluster centers or centers after n iterations optionally output clustered data points 7/25/2011 DIMA TU Berlin 97

88 Stratosphere Implementation 7/25/2011 DIMA TU Berlin 98 Source:

89 Code analysis Comparison of the implementations actual execution plan in the underlying distributed systems is nearly equivalent Stratosphere implementation is more intuitive and closer to the mathematical formulation of the algorithm 7/25/2011 DIMA TU Berlin 99

90 Problem description Naïve Bayes Simple classification algorithm based on Bayes theorem General Naïve Bayes assumes feature independence often good results even this is not given Mahout s version of Naïve Bayes Specialized approach for document classification based on tf-idf weight metric 7/25/2011 DIMA TU Berlin 100

91 M/R Overview Classification straight-forward approach, simply reads complete model into memory classification is done in the mapper, reducer only sums up statistics for confusion matrix Trainer much higher complexity needs to count documents, features, features per document, features per corpus Mahout s implementation is optimized by exploiting Hadoop specific features like secondary sort and reading results in memory from the cluster filesystem 7/25/2011 DIMA TU Berlin 101

92 M/R Trainer Overview Train Data Feature Extractor TermDoc Counter WordFr. Counter Doc Counter termdocc wordfreq docc Tf-Idf Calculation Tf-Idf tfidf Weight Summer σ k σ k σ k σ j σ k σ j σ k σ j Theta Normalizer Feature Counter featurec Vocab Counter vocabc Theta N. thetanorm 7/25/2011 DIMA TU Berlin 102

93 Pact Trainer Overview PACT implementation looks even more complex, but PACTs can be combined in a much more fine-grained manner as PACT offers the ability to use local memory forwards, more and higher level functions can be used like Cross and Match less framework specific tweaks necessary for a performant implementation visualized execution plan is much more similar to the algorithmic formulation of computing several counts and combining them to a model in the end subcalculations can be seen and unit-tested in isolation 7/25/2011 DIMA TU Berlin 103

94 PACT Trainer Overview 7/25/2011 DIMA TU Berlin 104

95 Hot Path 7,4 GB 14,8 GB 5,89 GB 5,89 GB 3,53 GB 84 kb 8 kb 5 kb 7/25/2011 DIMA TU Berlin 105

96 Pact Trainer Overview Future work: PACT implementation can still be tuned by sampling input data more variable memory management of Stratosphere employing context-concept of PACTs for simpler distribution of computed parameters 7/25/2011 DIMA TU Berlin 106

97 Hindi Russian Traditional Chinese Grazie Thank You Thai Gracias Spanish Obrigado Arabic Italian English Simplified Chinese Danke German Brazilian Portuguese Merci French Tamil Japanese Korean 7/25/2011 DIMA TU Berlin 107

98 Programming in a more abstract way PARALLEL DATA FLOW LANGUAGES 7/25/2011 DIMA TU Berlin 108

99 Introduction MapReduce paradigm is too low-level Only two declarative primitives (map + reduce) Extremely rigid (one input, two-stage data flow) Custom code for e.g.: projection and filtering Code is difficult to reuse and maintain Impedes Optimizations Combination of high-level declarative querying and low-level programming with MapReduce Dataflow Programming Languages Hive JAQL Pig 7/25/2011 DIMA TU Berlin 109

100 Hive Data warehouse infrastructure built on top of Hadoop, providing: Data Summarization Ad hoc querying Simple query language: Hive QL (based on SQL) Extendable via custom mappers and reducers Subproject of Hadoop No Hive format 7/25/2011 DIMA TU Berlin 110

101 Hive - Example LOAD DATA INPATH `/data/visits` INTO TABLE visits INSERT OVERWRITE TABLE visitcounts SELECT url, category, count(*) FROM visits GROUP BY url, category; LOAD DATA INPATH /data/urlinfo INTO TABLE urlinfo INSERT OVERWRITE TABLE visitcounts SELECT vc.*, ui.* FROM visitcounts vc JOIN urlinfo ui ON (vc.url = ui.url); INSERT OVERWRITE TABLE gcategories SELECT category, count(*) FROM visitcounts GROUP BY category; INSERT OVERWRITE TABLE topurls SELECT TRANSFORM (visitcounts) USING top10 ; 7/25/2011 DIMA TU Berlin 111

102 JAQL Higher level query language for JSON documents Developed at IBM s Almaden research center Supports several operations known from SQL Grouping, Joining, Sorting Built-in support for Loops, Conditionals, Recursion Custom Java methods extend JAQL JAQL scripts are compiled to MapReduce jobs Various I/O Local FS, HDFS, Hbase, Custom I/O adapters 7/25/2011 DIMA TU Berlin 112

103 JAQL - Example registerfunction( top, de.tuberlin.cs.dima.jaqlextensions.top10 ); $visits = hdfsread( /data/visits ); $visitcounts = $visits -> group by $url = $ into { $url, num: count($)}; $urlinfo = hdfsread( data/urlinfo ); $visitcounts = join $visitcounts, $urlinfo where $visitcounts.url == $urlinfo.url; $gcategories = $visitcounts -> group by $category = $ into {$category, num: count($)}; $topurls = top10($gcategories); hdfswrite( /data/topurls, $topurls); 7/25/2011 DIMA TU Berlin 113

104 Pig A platform for analyzing large data sets Pig consists of two parts: PigLatin: A Data Processing Language Pig Infrastructure: An Evaluator for PigLatin programs Pig compiles Pig Latin into physical plans Plans are to be executed over Hadoop Interface between the declarative style of SQL and lowlevel, procedural style of MapReduce 7/25/2011 DIMA TU Berlin 114

105 Pig - Example visits = load /data/visits as (user, url, time); visitcounts = foreach visits generate url, count(visits); urlinfo = load /data/urlinfo as (url, category, prank); visitcounts = join visitcounts by url, urlinfo by url; gcategories = group visitcounts by category; topurls = foreach gcategories generate top(visitcounts,10); store topurls into /data/topurls ; Example taken from: Pig Latin: A Not-So-Foreign Language For Data Processing Talk, Sigmod /25/2011 DIMA TU Berlin 115

106 Literature C. Olston, et al. (2008). `Pig Latin: a not-so-foreign language for data processing'. In SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp , New York, NY, USA. ACM. Apache Pig Hive - A Warehousing Solution Over a Map-Reduce Framework. Thusoo, Ashish; Sarma, Joydeep Sen; Jain, Namit; Shao, Zheng; Chakka, Prasad; Anthony, Suresh; Liu, Hao; Wyckoff, Pete; Murthy, Raghotham Apache Hive Towards a Scalable Enterprise Content Analytics Platform. Kevin S. Beyer, Vuk Ercegovac, Rajasekar Krishnamurthy, Sriram Raghavan, Jun Rao, Frederick Reiss, Eugene J. Shekita, David E. Simmen, Sandeep Tata, Shivakumar Vaithyanathan, Huaiyu Zhu. IEEE Data Eng. Bull. (32): (2009) JAQL 7/25/2011 DIMA TU Berlin 116

107 QUERY COPROCESSING ON GRAPHICS PROCESSORS 7/25/2011 DIMA TU Berlin 117

108 Query Coprocessing on GPUs Graphics Processors (GPUs) have recently emerged as powerful coprocessors for general purpose computation 10x computational power compared to the CPU 5x memory bandwith compared to the CPU Parallel primitives available for query processing that provide exploitation of GPU hardware features such as high thread parallelism and reduction of memory stalls through the fast local memory are scalable to hundreds of processors because of their lock-free design and low synchronization cost through the use of local memory 7/25/2011 DIMA TU Berlin 118

109 Query Coprocessing on GPUs Map given an array of data tuples and a function, a map applies the function to every tuple uses multiple thread groups to scan the relation with each thread group being responsible for a segment of the relation the access pattern of the threads in each thread group is designed to exploit the coalesced memory access feature on the GPU Scatter and Gather Scatter: perform indexed writes to a relation (e.g. hashing) defined by a location array Gather: perform indexed reads from a relation also defined by a location array can be implemented using the multipass optimization scheme to improve their temporal locality 7/25/2011 DIMA TU Berlin 119

110 Query Coprocessing on GPUs Prefix scan applies a binary operator to the input relation example: prefix sum, an important operation in parallel databases Reduce computes a value based on the input relation implemented as multipass algorithm by utilizing local memory optimization logarithmic number of passes constrained by local memory size per multiprocessor 7/25/2011 DIMA TU Berlin 120

111 An Architectural Hybrid of MapReduce and DBMS HADOOP DB 7/25/2011 DIMA TU Berlin 121

112 Parallel Data Processing Architectures Two major architectures: 1. Parallel databases Standard relational databases in a (usually) shared-nothing cluster. 2. MapReduce Data analysis via parallel Map and Reduce jobs in a replicated cluster. Both approaches have their Pros and Cons. 7/25/2011 DIMA TU Berlin 122

113 Parallel RDBMs Pros: Usually very good and consistent performance. Flexible and proven interface (SQL). Cons: Scaling is rather limited (10s of nodes). Does not work well in heterogeneous clusters. Not very Fault-Tolerant. 7/25/2011 DIMA TU Berlin 123

114 MapReduce Pros: Very fault-tolerant and automatic load-balancing. Operates well in heterogeneous clusters. Cons: Writing map/reduce jobs is more complicated than writing SQL queries. Performance depends largely on the skill of the programmer. 7/25/2011 DIMA TU Berlin 124

115 HadoopDB Both approaches have their strengths and weaknesses. Idea of HadoopDB: Combine them! Traditional relational databases as data storage and data processing nodes. MapReduce for Query Parallelization, Job Tracking, etc. Automatic SQL to MapReduce to SQL (SMS) query rewriter (based on Hive). Pushing as many operations as possible into database layer improves data access performance. Map Reduce improves fault-tolerance and offers solid cluster management. 7/25/2011 DIMA TU Berlin 125

116 HadoopDB overview User SQL query SMS Planner MapReduce Job System catalog Master Node Map Reduce Job Tracker Task Tracker Task Tracker Task Tracker SQL SQL SQL Postgres DB Postgres DB Postgres DB Replicated Table Data Node #1 Node #2 Node #n 7/25/2011 DIMA TU Berlin 126

117 HadoopDB Sample query SELECT YEAR(saleDate), SUM(revenue) FROM sales GROUP BY YEAR(saleDate); SMS Rewrite 7/25/2011 DIMA TU Berlin 127

118 Experimental Findings (I) Compared with: native Hadoop (Hive), Vertica, commercial row-oriented DB. Experiments performed on a 10/50/100 node Amazon EC2 cloud instance. Used Benchmark: A. Pavlo et al: A Comparison of Approaches to Large Scale Data Analysis, SIGMOD, /25/2011 DIMA TU Berlin 128

119 Experimental Findings (II) In absence of failures, HadoopDB is usually slower than parallel DBMS. HadoopDB is consistently faster than Hadoop, but takes ~ 10 times longer to load data. HadoopDB s performance decreases significantly lower than Vertica s in case of node failures. HadoopDB is not as susceptible to single slow nodes as Vertica. 7/25/2011 DIMA TU Berlin 129

120 Literature A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Rasin, A. Silberschatz: HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB 2(1): (2009) 7/25/2011 DIMA TU Berlin 130

121 Basics of Parallel Processing Parallel Speedup Ahmdal s Law Levels of Parallelism Instruction-Level, Data, Task Modes of Query Parallelism Inter-Query / Intra-Query Pipeline (Inter Operator) / Data (Intra Operator) Parallel Database Operations 7/25/2011 DIMA TU Berlin 143

122 Parallel Speedup 7/25/2011 DIMA TU Berlin 144

123 Parallel Speedup 7/25/2011 DIMA TU Berlin 145

124 Levels of Parallelism on Hardware Instruction-level Parallelism Single instructions are automatically processed in parallel Example: Modern CPUs with multiple pipelines and instruction units. Data Parallelism Different Data can be processed independently Each processor executes the same operations on it s share of the input data. Example: Distributing loop iterations over multiple processors, or CPUs vectors Task Parallelism Tasks are distributed among the processors/nodes Each processor executes a different thread/process. Example: Threaded programs. 7/25/2011 DIMA TU Berlin 146

125 Modes of Query Parallelism Inter Query Parallelism (multiple concurrent queries) Necessary for efficient resource utilization: While one query waits (e.g. for I/O), another one executes Requires concurrency control (locking mechanisms) to guarantee transactional properties (the "I" in ACID) Important for highly transactional scenarios (OLTP) Intra-Query Parallelism (parallel processing of a single query) I/O Parallelism: Concurrent reading from multiple disks Hidden: Hardware RAID, Transparent: Spanned tablespaces Intra Operator Parallelism: Multiple threads work on the same operator. Example: Parallel Sort Inter Operator Parallelism: Multiple pipelined parts of the plan run in parallel Important for complex analytical tasks (OLAP) 7/25/2011 DIMA TU Berlin 147

126 Pipeline Parallelism Step 2: One thread scans the table, probes the hash tables. Second thread starts the sort (sorting sub-lists, merging the first lists) Scan T1 HS-Join Return Sort HS-Join Scan T2 Scan T3 Step 3: One thread, return result, business as usual Step 1: Two threads scan one base table each and build the hash tables for the joins. 7/25/2011 DIMA TU Berlin 148

127 Pipeline Parallelism Pipeline Parallelism, also called Inter Operator Parallelism Inter Operator, because the parallelism is between the operators Execute multiple pipelines simultaneously Limited in its applicability, only if multiple pipelines are present and not totally dependent on each other Problem: High synchronization overhead Mostly limited to lower degree of parallelism (not too many pipelines per query) Only suited for shared-memory architectures 7/25/2011 DIMA TU Berlin 149

128 Data Parallelism Pipeline Parallelism is not applicable to a large degree Data Parallelism Data divided into several sub-sets Most operations don't need a complete view of the data E.g. "Filter" looks only at a single tuple at a time Subsets can be are processed independently and hence in parallel Degree of Parallelism as high as the number of possible subsets For "Filter": As high as the number of tuples Some operations possibly need a view of larger portions of the data E.g. Grouping/Aggregation operation needs all tuples with the same grouping key Are they all in the same set? Can we guarantee that? Different operators need different sets! 7/25/2011 DIMA TU Berlin 150

129 Basics of Parallel Query Processing Levels of Resource Sharing Shared-Memory, Shared-Disk, Shared-Nothing Data Partitioning Round-robin, Hash, Range Parallel Operators and Costs Tuple-at-a-time (i.e. Selection) Sorting Projection, Grouping, Aggregation Join 7/25/2011 DIMA TU Berlin 151

130 Parallel Architectures (I) Shared Memory Several CPUs share a single memory and disk (array) Communication over a single common bus Source: Garcia-Molina et al., Database Systems The Complete Book. Second Edition 7/25/2011 DIMA TU Berlin 152

131 Parallel Architectures (II) Shared Disk Several nodes with multiple CPUs, each node has its private memory Single attached disk (array): Often NAS, SAN, etc Source: Garcia-Molina et al., Database Systems The Complete Book. Second Edition 7/25/2011 DIMA TU Berlin 153

132 Parallel Architectures (III) Shared Nothing Each node has it own set of CPUs, memory and disks attached Data needs to be partitioned over the nodes Data is exchanged through direct node-to-node communication Source: Garcia-Molina et al., Database Systems The Complete Book. Second Edition 7/25/2011 DIMA TU Berlin 154

133 Data Partitioning (I) Partitioning the data means creating a set of disjunct sub-sets Example: Sales data, every year gets its own partition For shared-nothing, data must be partitioned across nodes If it were replicated, it would effectively become a shared-disk with the local disks acting like a cache (must be kept coherent) Partitioning with certain characteristics has more advantages Some queries can be limited to operate on certain sets only, if it is provable that all relevant data (passing the predicates) is in that partition Partitions can be simply dropped as a whole (data is rolled out) when it is no longer needed (e.g. discard old sales) 7/25/2011 DIMA TU Berlin 155

134 Data Partitioning (II) How to partition the data into disjoint sets? Round robin: Each set gets a tuple in a round, all sets have guaranteed equal amount of tuples, no apparent relationship between tuples in one set. Hash Partitioned: Define a set of partitioning columns. Generate a hash value over those columns to decide the target set. All tuples with equal values in the partitioning columns are in the same set. Range Partitioned: Define a set of partitioning columns and split the domain of those columns into ranges. The range determines the target set. All tuples on one set are in the same range. 7/25/2011 DIMA TU Berlin 156

135 Data Parallelism Example Client send a SQL query to one of the cluster nodes Node becomes the "coordinator" Coordinator compiles the query Client Parsing, Checking, Optimization Query Final Results Coordinator Cluster- Node Parallelization Sends partial plans to the other cluster nodes that describes their tasks Cluster- Node Partial Results Cluster- Node Coordinator also executes the partial plan on his part of the data Collects partial results and finalizes them (see next slide) 7/25/2011 DIMA TU Berlin 157

136 Data Parallelism Example For shared-nothing & shared-disk Multiple instances of a sub-plan are executed on different computers The instances operate on different splits or partitions of the data At some points, results from the subplans are collected For more complex queries, results are not collected but re-distributed, for further parallel processing Point of data shipping Pre- Aggregation Return Group Agg Queue Group Agg Sort NL-Join Final Aggregation Sub-plan result collection Parallel Instances Fetch T2 (part) Scan IX-Scan T1 (part) IX-T2.1 (part) 7/25/2011 DIMA TU Berlin 158

137 Parallel Operators Ideally: Operate as much as possible on individual partitions of the data Bring the operation to the data No communication needed, ideal parallelism Easy for simple "per-tuple" operators Scan, IX-Scan, Fetch, Filter Problematic: Some operators need the whole picture E.g. Sorts and Aggregations can only be preprocessed in parallel and need a final step on a single node. Unless: They occur in a correlated subplan known to contain only tuples from one partition. E.g. Joins need matching tuples. Either organize the inputs accordingly, or join on the coordinator after the collection of partial results (not parallel any more!). 7/25/2011 DIMA TU Berlin 159

138 Notations and Assumptions S Relation S S[i, h] Partition i of relation S according to partitioning scheme h. B(S) Number of Blocks of Relation S p Number of Nodes We assume a shared-nothing architecture Most commercial database vendors use shared-nothing approaches. Network transfer is at least as expensive as disk access In some cost models still far more expensive Today network bandwidth disk bandwidth But: Network is shared, especially Switches and Routers have a throughput limit Partitioning schemes (hash/range) produce partitions of roughly equal size. 7/25/2011 DIMA TU Berlin 160

139 Parallel Selection Selection can be parallelized very efficiently (embarrassingly parallel problem) Each node performs the selection on its existing local partition. Selection needs no context Data can be partitioned in a arbitrary way Partial results are unioned afterwards. Cost: B(S)/p 7/25/2011 DIMA TU Berlin 161

140 Parallel Projection, Grouping, Aggregation 7/25/2011 DIMA TU Berlin 162

141 Parallel Sorting Range partitioning sort (partition by range, then sort) Range-partition the relation according to the sort columns Sort the single partitions locally (e.g. by TPMMS) Cost: B(S) partitioning + B(S) transfer + B(S)/p local sorting Problem: How to find a uniform range parititioning scheme? Result is already partitoned in the cluster. Parallel External sort-merge (sort locally, then merge) Reuse an existing data partitioning Partitions are sorted locally (e.g. by TPMMS) Sorted partitions need to be merged E.g.: One node merges two partitions until the whole relation is sorted Cost: B(S)/p local sorting + log2(p)*b(s)/2 transfer + log2(p)*b(s) local merging Result is sitting on one machine. 7/25/2011 DIMA TU Berlin 163

142 A special class of Joins that are suited for parallelization are Natural- and Equi-Joins. Parallel Equi-Joins (I) For Equi-Joins we only look at tuple pairs that share the same join key. Idea: Partition relations R and S using the same partitioning scheme over the join key. All values of R and S with the same join key end up at the same node! All joins can be performed locally! Actual implementation depends on how the relations are partitioned: Co-Located Join Directed Join Re-Partitioning Join 7/25/2011 DIMA TU Berlin 164

143 Parallel Equi-Joins (II) 1. Both R and S are already partitioned over the join key (and with the same partitioning scheme): Co-Located Join No re-partitioning is needed! Cost:??? Local join cost 2. Only one relation is partitioned over the join key: Directed Join Re-Partition the other relation with same partitioning scheme. Cost (assuming R is already partitioned): B(S) partitioning B(S) transfer??? Local join cost 3. No relation is partitioned over the join key: Repartition Join Re-Partition both relations over the join key Cost: B(S)+B(R) partitioning B(S)+B(R) transfer??? Local join cost 7/25/2011 DIMA TU Berlin 165

144 Symmetric Fragment-and-Replicate Join Join 7/25/2011 DIMA TU Berlin 166

145 Symmetric Fragment-and-Replicate Join (II) Nodes in the Cluster 7/25/2011 DIMA TU Berlin 167

146 Asymmetric Fragment-and-Replicate Join We can do better, if relation S is much smaller than R. Idea: Reuse the existing partitioning of R and replicate the whole relation S to each node. Cost: p * B(S) transport??? local join Asymmetric Fragment-and-replicate Join is a special case of the Symmetric Algorithm with m=p and n=1. The Asymmetric Fragment-and-replicate Join is also called Broadcast Join 7/25/2011 DIMA TU Berlin 168

StratoSphere Above the Clouds

StratoSphere Above the Clouds Stratosphere Parallel Analytics in the Cloud beyond Map/Reduce 14th International Workshop on High Performance Transaction Systems (HPTS) Poster Sessions, Mon Oct 24 2011 Thomas Bodner T.U. Berlin StratoSphere

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Hadoop Ecosystem Overview of this Lecture Module Background Google MapReduce The Hadoop Ecosystem Core components: Hadoop

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Play with Big Data on the Shoulders of Open Source

Play with Big Data on the Shoulders of Open Source OW2 Open Source Corporate Network Meeting Play with Big Data on the Shoulders of Open Source Liu Jie Technology Center of Software Engineering Institute of Software, Chinese Academy of Sciences 2012-10-19

More information

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich Big Data Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Google MapReduce The Hadoop

More information

Firebird meets NoSQL (Apache HBase) Case Study

Firebird meets NoSQL (Apache HBase) Case Study Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

Big Data and Scripting Systems build on top of Hadoop

Big Data and Scripting Systems build on top of Hadoop Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform interactive execution of map reduce jobs Pig is the name of the system Pig Latin is the

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

Enhancing Massive Data Analytics with the Hadoop Ecosystem

Enhancing Massive Data Analytics with the Hadoop Ecosystem www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3, Issue 11 November, 2014 Page No. 9061-9065 Enhancing Massive Data Analytics with the Hadoop Ecosystem Misha

More information

Recap. CSE 486/586 Distributed Systems Data Analytics. Example 1: Scientific Data. Two Questions We ll Answer. Data Analytics. Example 2: Web Data C 1

Recap. CSE 486/586 Distributed Systems Data Analytics. Example 1: Scientific Data. Two Questions We ll Answer. Data Analytics. Example 2: Web Data C 1 ecap Distributed Systems Data Analytics Steve Ko Computer Sciences and Engineering University at Buffalo PC enables programmers to call functions in remote processes. IDL (Interface Definition Language)

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Data Management in the Cloud

Data Management in the Cloud Data Management in the Cloud Ryan Stern stern@cs.colostate.edu : Advanced Topics in Distributed Systems Department of Computer Science Colorado State University Outline Today Microsoft Cloud SQL Server

More information

Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

More information

Can the Elephants Handle the NoSQL Onslaught?

Can the Elephants Handle the NoSQL Onslaught? Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented

More information

Big Data Technology Pig: Query Language atop Map-Reduce

Big Data Technology Pig: Query Language atop Map-Reduce Big Data Technology Pig: Query Language atop Map-Reduce Eshcar Hillel Yahoo! Ronny Lempel Outbrain *Based on slides by Edward Bortnikov & Ronny Lempel Roadmap Previous class MR Implementation This class

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex

More information

The Stratosphere Big Data Analytics Platform

The Stratosphere Big Data Analytics Platform The Stratosphere Big Data Analytics Platform Amir H. Payberah Swedish Institute of Computer Science amir@sics.se June 4, 2014 Amir H. Payberah (SICS) Stratosphere June 4, 2014 1 / 44 Big Data small data

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Big Data looks Tiny from the Stratosphere

Big Data looks Tiny from the Stratosphere Volker Markl http://www.user.tu-berlin.de/marklv volker.markl@tu-berlin.de Big Data looks Tiny from the Stratosphere Data and analyses are becoming increasingly complex! Size Freshness Format/Media Type

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

A Comparison of Approaches to Large-Scale Data Analysis

A Comparison of Approaches to Large-Scale Data Analysis A Comparison of Approaches to Large-Scale Data Analysis Sam Madden MIT CSAIL with Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, and Michael Stonebraker In SIGMOD 2009 MapReduce

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2

More information

Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de

Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft

More information

Integrating Big Data into the Computing Curricula

Integrating Big Data into the Computing Curricula Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big

More information

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Advanced Big Data Analytics with R and Hadoop

Advanced Big Data Analytics with R and Hadoop REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

More information

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research

More information

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

More information

Big Data Analytics. Chances and Challenges. Volker Markl

Big Data Analytics. Chances and Challenges. Volker Markl Volker Markl Professor and Chair Database Systems and Information Management (DIMA), Technische Universität Berlin www.dima.tu-berlin.de Big Data Analytics Chances and Challenges Volker Markl DIMA BDOD

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

Exploiting Dynamic Resource Allocation for Efficient Parallel Data Processing in the Cloud

Exploiting Dynamic Resource Allocation for Efficient Parallel Data Processing in the Cloud IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, JANUARY 2011 1 Exploiting Dynamic Resource Allocation for Efficient Parallel Data Processing in the Cloud Daniel Warneke and Odej Kao Abstract In

More information

A Brief Introduction to Apache Tez

A Brief Introduction to Apache Tez A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

More information

Lecture Data Warehouse Systems

Lecture Data Warehouse Systems Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

NetFlow Analysis with MapReduce

NetFlow Analysis with MapReduce NetFlow Analysis with MapReduce Wonchul Kang, Yeonhee Lee, Youngseok Lee Chungnam National University {teshi85, yhlee06, lee}@cnu.ac.kr 2010.04.24(Sat) based on "An Internet Traffic Analysis Method with

More information

Moving From Hadoop to Spark

Moving From Hadoop to Spark + Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com sujee@elephantscale.com Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

Lecture 10 - Functional programming: Hadoop and MapReduce

Lecture 10 - Functional programming: Hadoop and MapReduce Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional

More information

Big Graph Processing: Some Background

Big Graph Processing: Some Background Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs

More information

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc. Oracle BI EE Implementation on Netezza Prepared by SureShot Strategies, Inc. The goal of this paper is to give an insight to Netezza architecture and implementation experience to strategize Oracle BI EE

More information

The Sierra Clustered Database Engine, the technology at the heart of

The Sierra Clustered Database Engine, the technology at the heart of A New Approach: Clustrix Sierra Database Engine The Sierra Clustered Database Engine, the technology at the heart of the Clustrix solution, is a shared-nothing environment that includes the Sierra Parallel

More information

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

More information

Best Practices for Hadoop Data Analysis with Tableau

Best Practices for Hadoop Data Analysis with Tableau Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks

More information

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford SQL VS. NO-SQL Adapted Slides from Dr. Jennifer Widom from Stanford 55 Traditional Databases SQL = Traditional relational DBMS Hugely popular among data analysts Widely adopted for transaction systems

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc.

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc. Oracle9i Data Warehouse Review Robert F. Edwards Dulcian, Inc. Agenda Oracle9i Server OLAP Server Analytical SQL Data Mining ETL Warehouse Builder 3i Oracle 9i Server Overview 9i Server = Data Warehouse

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

What is Analytic Infrastructure and Why Should You Care?

What is Analytic Infrastructure and Why Should You Care? What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group grossman@uic.edu ABSTRACT We define analytic infrastructure to be the services,

More information

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional

More information

Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

More information

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010 Hadoop s Entry into the Traditional Analytical DBMS Market Daniel Abadi Yale University August 3 rd, 2010 Data, Data, Everywhere Data explosion Web 2.0 more user data More devices that sense data More

More information

Advanced In-Database Analytics

Advanced In-Database Analytics Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??

More information

Cost-Effective Business Intelligence with Red Hat and Open Source

Cost-Effective Business Intelligence with Red Hat and Open Source Cost-Effective Business Intelligence with Red Hat and Open Source Sherman Wood Director, Business Intelligence, Jaspersoft September 3, 2009 1 Agenda Introductions Quick survey What is BI?: reporting,

More information

Introduction to DISC and Hadoop

Introduction to DISC and Hadoop Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

low-level storage structures e.g. partitions underpinning the warehouse logical table structures DATA WAREHOUSE PHYSICAL DESIGN The physical design of a data warehouse specifies the: low-level storage structures e.g. partitions underpinning the warehouse logical table structures low-level structures

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Data Mining in the Swamp

Data Mining in the Swamp WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all

More information

Apache Hadoop FileSystem and its Usage in Facebook

Apache Hadoop FileSystem and its Usage in Facebook Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency

More information

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti

More information

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB Overview of Databases On MacOS Karl Kuehn Automation Engineer RethinkDB Session Goals Introduce Database concepts Show example players Not Goals: Cover non-macos systems (Oracle) Teach you SQL Answer what

More information

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person

More information

Hadoop: Embracing future hardware

Hadoop: Embracing future hardware Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction

More information

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011 Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis

More information

White Paper February 2010. IBM InfoSphere DataStage Performance and Scalability Benchmark Whitepaper Data Warehousing Scenario

White Paper February 2010. IBM InfoSphere DataStage Performance and Scalability Benchmark Whitepaper Data Warehousing Scenario White Paper February 2010 IBM InfoSphere DataStage Performance and Scalability Benchmark Whitepaper Data Warehousing Scenario 2 Contents 5 Overview of InfoSphere DataStage 7 Benchmark Scenario Main Workload

More information

Massive scale analytics with Stratosphere using R

Massive scale analytics with Stratosphere using R Massive scale analytics with Stratosphere using R Jose Luis Lopez Pino jllopezpino@gmail.com Database Systems and Information Management Technische Universität Berlin Supervised by Volker Markl Advised

More information

Big Data and Scripting Systems build on top of Hadoop

Big Data and Scripting Systems build on top of Hadoop Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is

More information

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing

More information

Oracle s Big Data solutions. Roger Wullschleger.

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here> s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline

More information

Report: Declarative Machine Learning on MapReduce (SystemML)

Report: Declarative Machine Learning on MapReduce (SystemML) Report: Declarative Machine Learning on MapReduce (SystemML) Jessica Falk ETH-ID 11-947-512 May 28, 2014 1 Introduction SystemML is a system used to execute machine learning (ML) algorithms in HaDoop,

More information

A Comparison of Join Algorithms for Log Processing in MapReduce

A Comparison of Join Algorithms for Log Processing in MapReduce A Comparison of Join Algorithms for Log Processing in MapReduce Spyros Blanas, Jignesh M. Patel Computer Sciences Department University of Wisconsin-Madison {sblanas,jignesh}@cs.wisc.edu Vuk Ercegovac,

More information

THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCE COMPARING HADOOPDB: A HYBRID OF DBMS AND MAPREDUCE TECHNOLOGIES WITH THE DBMS POSTGRESQL

THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCE COMPARING HADOOPDB: A HYBRID OF DBMS AND MAPREDUCE TECHNOLOGIES WITH THE DBMS POSTGRESQL THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCE COMPARING HADOOPDB: A HYBRID OF DBMS AND MAPREDUCE TECHNOLOGIES WITH THE DBMS POSTGRESQL By VANESSA CEDENO A Dissertation submitted to the Department

More information

Hadoop SNS. renren.com. Saturday, December 3, 11

Hadoop SNS. renren.com. Saturday, December 3, 11 Hadoop SNS renren.com Saturday, December 3, 11 2.2 190 40 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Hadoop Parallel Data Processing

Hadoop Parallel Data Processing MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for

More information

Entering the Zettabyte Age Jeffrey Krone

Entering the Zettabyte Age Jeffrey Krone Entering the Zettabyte Age Jeffrey Krone 1 Kilobyte 1,000 bits/byte. 1 megabyte 1,000,000 1 gigabyte 1,000,000,000 1 terabyte 1,000,000,000,000 1 petabyte 1,000,000,000,000,000 1 exabyte 1,000,000,000,000,000,000

More information

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib

More information