Databases 2 (VU) ( )

Size: px
Start display at page:

Download "Databases 2 (VU) ( )"

Transcription

1 Databases 2 (VU) ( ) MapReduce (Part 3) Mark Kröll KTI, TU Graz Nov. 14, 2016 Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

2 Outline 1 Problems Suited for Map-Reduce Matrix-Vector Multiplication Relational-Algebra Operations 2 Hadoop Ecosystem Big Data Storage Technologies Slides are partially based on Mining Massive Datasets by Jure Leskovec Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

3 MapReduce: Applications MapReduce computation makes sense when files are large and rarely updated in place Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

4 MapReduce: Applications MapReduce computation makes sense when files are large and rarely updated in place not suitable when managing online sales (Amazon) the principal operations on Amazon data involve responding to searches for products, recording sales, and so on, processes that involve relatively little calculation and that change the database won t see MapReduce for handling Web requests (even if we have millions of users) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

5 MapReduce: Applications MapReduce computation makes sense when files are large and rarely updated in place not suitable when managing online sales (Amazon) the principal operations on Amazon data involve responding to searches for products, recording sales, and so on, processes that involve relatively little calculation and that change the database won t see MapReduce for handling Web requests (even if we have millions of users) however, you want to use MapReduce for analytic queries on the data generated by an e.g. Web application find users with similar buying patterns ranking search results Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

6 MapReduce: Applications computations such as analytic queries typically involve matrix operations original purpose for the MapReduce implementation was to execute large matrix-vector multiplications to calculate the PageRank Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

7 MapReduce: Applications computations such as analytic queries typically involve matrix operations original purpose for the MapReduce implementation was to execute large matrix-vector multiplications to calculate the PageRank matrix operations such as matrix-matrix and matrix-vector multiplications fit nicely into MapReduce programming model another important class of operations that can use MapReduce effectively are relational-algebra operations Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

8 MapReduce: Applications Matrix-Vector Multiplication Suppose we have an n n matrix M, whose element in row i and column j will be denoted m ij. Suppose we also have a vector v of length n, whose jth element is v j. Then the matrix-vector product is the vector x of length n, whose ith element is given by x i = n m ij v j j=1 Outline a Map-Reduce program that calculates the vector x. Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

9 Matrix-Vector Multiplication Matrix-Vector Multiplication Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

10 Matrix-Vector Multiplication Matrix-Vector Multiplication let us first assume that the vector v is large, but it still can fit into the memory the matrix M and the vector v will be each stored in a file of the DFS assume that the row-column coordinates of a matrix element (indices) can be discovered for example, each value is stored as a triple (i, j, m ij ) similarly, the position of v j can be discovered analogously Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

11 Matrix-Vector Multiplication Matrix-Vector Multiplication Map Function: the map function applies to one element of the matrix M the vector v is first read in its entirety and is available for all Map tasks at that compute node from each matrix element m ij the map function produces the key-value pair (i, m ij v j ) all terms of the sum that make up the component x i of the matrix-vector product will get the same key i Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

12 Matrix-Vector Multiplication Matrix-Vector Multiplication Map Function: the map function applies to one element of the matrix M the vector v is first read in its entirety and is available for all Map tasks at that compute node from each matrix element m ij the map function produces the key-value pair (i, m ij v j ) all terms of the sum that make up the component x i of the matrix-vector product will get the same key i Reduce Function: reduce function sums all the values associated with a given key i result is a pair (i, x i ) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

13 Matrix-Vector Multiplication Matrix-Vector Multiplication however, it might be that the vector v does not fit into main memory it is not required that the vector v fits into the memory at a compute node, but if it does not there will be a very large number of disk accesses as we move pieces of the vector into main memory Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

14 Matrix-Vector Multiplication Matrix-Vector Multiplication however, it might be that the vector v does not fit into main memory it is not required that the vector v fits into the memory at a compute node, but if it does not there will be a very large number of disk accesses as we move pieces of the vector into main memory alternatively we can divide the matrix M into vertical stripes of equal width and divide the vector into an equal number of horizontal stripes of the same height use enough stripes so that the portion of the vector in one stripe can fit into main memory at a compute node Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

15 Matrix-Vector Multiplication Matrix-Vector Multiplication Figure: Divide matrix M and vector v into stripes Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

16 Matrix-Vector Multiplication Matrix-Vector Multiplication the ith stripe of matrix M multiplies only components from the ith stripe of the vector can divide matrix M into one file for each stripe, and do the same for the vector v each Map task is assigned a chunk from one of the stripes in the matrix and gets the entire corresponding stripe of the vector Map and Reduce tasks can then act exactly as before need to sum up once more the results of the stripes multiplication Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

17 Relational-Algebra Operations Relational-Algebra Operations many operation on data can be described easily in terms of the common database-query primitives the queries themselves must not be executed within a DBMS e.g. standard operations on relations such as selection Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

18 Relational-Algebra Operations Relational-Algebra Operations many operation on data can be described easily in terms of the common database-query primitives the queries themselves must not be executed within a DBMS e.g. standard operations on relations such as selection a relation is a table with column headers called attributes the set of attributes of a relation R is called its schema: R(A 1, A 2,..., A n ) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

19 Relational-Algebra Operations Relation Links From To url1 url2 url1 url3 url2 url3 url2 url Table: The relation consists of the set of pairs of URL s, such that the first has one or more links to the second Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

20 Relational-Algebra Operations Relation Links a tuple is a pair of URLs such that there is at least one link from the first to the second URL the first row (url1, url2) states that the Web page at url1 points to the Web page at url2 a similar relation is typically stored by a search engine (with billions of tuples) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

21 Relational-Algebra Operations Relational-Algebra Standard operation on relations are 1 Selection(σ): apply a condition C to each tuple and output only tuples that satisfy C 2 Projection (π): produce from each tuple only a subset S of attributes 3 Union, Intersection, Difference: set operations on tuples 4 Natural Join ( ): Given two relations compare each pair of tuples and output those that agree on all common attributes 5 Grouping and Aggregation (γ, θ): partition the tuples in a relation according to their values in a set of attributes. For each group perform one of the operations such as Sum, Count, Avg, Min or Max Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

22 Relational-Algebra Operations Example 1: Paths of length 2 find paths of length 2 in the Web using the Links relation in other words find triples of URLs (u, v, w) such that there is a link between u and v and a link between v and w we want to take natural join of Links with itself let us describe this with two copies of Links: L1(U1, U2) and L2(U2, U3) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

23 Relational-Algebra Operations Example 1: Paths of length 2 now we compute L1(U1, U2) L2(U2, U3) for each tuple t1 of L1 and each tuple t2 of L2, we see if their U2 components are same these components are the second component of t1 and the first component of t2) if these two components agree, we produce (U1, U2, U3) as a result if we want only to check for the existence of the path of length two we might want to project onto U1 and U3 π U1,U3 (L1 L2) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

24 Relational-Algebra Operations Example 2: Number of friends imagine that a social-networking site has a relation Friends(User, Friend) suppose we want to calculate the statistics about the number of friends of each user in terms of relational algebra we would perform grouping and aggregation: γ User,COUNT (Friend) (Friends) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

25 Relational-Algebra Operations Example 2: Number of friends imagine that a social-networking site has a relation Friends(User, Friend) suppose we want to calculate the statistics about the number of friends of each user in terms of relational algebra we would perform grouping and aggregation: γ User,COUNT (Friend) (Friends) this operation groups all tuples by the value of the first component and then counts the number of friends one tuple for each group, and a typical tuple would look like (Sally, 300), if user Sally has 300 friends Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

26 Relational-Algebra Operations Selection by MapReduce Selection given is a relation R; we want to compute σ C (R); can be done most conveniently in the map part alone Map Function: for each tuple t in R, test if it satisfies C if so produce the key value pair (t, t) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

27 Relational-Algebra Operations Selection by MapReduce Selection given is a relation R; we want to compute σ C (R); can be done most conveniently in the map part alone Map Function: for each tuple t in R, test if it satisfies C if so produce the key value pair (t, t) Reduce Function: the reduce function is identity it simply passes each key-value pair to the output Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

28 Relational-Algebra Operations Projection by Map-Reduce Projection given is a relation R; we want to compute π S (R) Map Function: for each tuple t in R construct a tuple t by eliminating from t those components that are not in projection S output the key-value pair (t, t ) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

29 Relational-Algebra Operations Projection by Map-Reduce Projection given is a relation R; we want to compute π S (R) Map Function: for each tuple t in R construct a tuple t by eliminating from t those components that are not in projection S output the key-value pair (t, t ) Reduce Function: for each key t there will be one or more key-value pairs (t, t ) the reduce function turns (t, [t, t,..., t ]) into (t, t ) the Reduce operation equals a duplicate elimination Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

30 Relational-Algebra Operations Natural-Join by Map-Reduce Natural-Join given are relations R(A, B) and S(B, C); we want to compute R S must find tuples that agree on their B components Map Function: for each tuple (a, b) of R produce the key-value pair (b, (R, a)) for each tuple (b, c) of S produce the key-value pair (b, (S, c)) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

31 Relational-Algebra Operations Natural-Join by Map-Reduce Natural-Join given are relations R(A, B) and S(B, C); we want to compute R S must find tuples that agree on their B components Map Function: for each tuple (a, b) of R produce the key-value pair (b, (R, a)) for each tuple (b, c) of S produce the key-value pair (b, (S, c)) Reduce Function: each key b will be associated with a list of pairs that are either of the form (R, a) or (S, c) construct all pairs consisting of the values (a, b, c) the challenge is to convert one s task in a way that it can be processed by MapReduce; so that it adheres to its internal key/value structure Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

32 Relational-Algebra Operations Grouping and Aggregation by Map-Reduce Grouping and Aggregation given is a relation R(A, B, C); we want to compute γ A,θ(B) (R) Map Function: Map produces the grouping for each tuple (a, b, c) produce the key-value pair (a, b) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

33 Relational-Algebra Operations Grouping and Aggregation by Map-Reduce Grouping and Aggregation given is a relation R(A, B, C); we want to compute γ A,θ(B) (R) Map Function: Map produces the grouping for each tuple (a, b, c) produce the key-value pair (a, b) Reduce Function: reduce function produces the aggregation each key a represents a group apply the aggregation operator θ to the list [b 1, b 2,..., b n ] of the values associated with a output is a pair (a, x), where x is the result of θ applied to the list Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

34 Hadoop Ecosystem Hadoop Eco System (v1) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

35 Hadoop Ecosystem Hadoop Eco System (v1) HBase open source, non-relational, distributed database modeled after Google s BigTable and is written in Java provides a fault-tolerant way of storing large quantities of sparse data Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

36 Hadoop Ecosystem Hadoop Eco System (v1) HBase Hive open source, non-relational, distributed database modeled after Google s BigTable and is written in Java provides a fault-tolerant way of storing large quantities of sparse data data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis a data warehouse is a system used for reporting and data analysis Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

37 Hadoop Ecosystem Hadoop Eco System (v1) HBase Hive Pig open source, non-relational, distributed database modeled after Google s BigTable and is written in Java provides a fault-tolerant way of storing large quantities of sparse data data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis a data warehouse is a system used for reporting and data analysis high-level platform for creating programs that run on Apache Hadoop (language is called Pig Latin) abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

38 Hadoop Ecosystem Hadoop Eco System (v2) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

39 Hadoop Ecosystem Hadoop Eco System (v2) Yarn (Yet Another Resource Negotiator) is a cluster management system to run Big Data applications on a cluster data; not a data processing platform itself but enables the platforms to run their code in a cluster environment Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

40 Hadoop Ecosystem Hadoop Eco System (v2) Yarn (Yet Another Resource Negotiator) is a cluster management system to run Big Data applications on a cluster data; not a data processing platform itself but enables the platforms to run their code in a cluster environment Spark, Flink, Storm are frameworks for cluster computing specializing in either batch processing, stream processing or both (see next slide) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

41 Hadoop Ecosystem Hadoop Eco System (v2) Yarn (Yet Another Resource Negotiator) is a cluster management system to run Big Data applications on a cluster data; not a data processing platform itself but enables the platforms to run their code in a cluster environment Spark, Flink, Storm Giraph Impala are frameworks for cluster computing specializing in either batch processing, stream processing or both (see next slide) utilizes Apache Hadoop s MapReduce implementation to process graphs is Cloudera s open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

42 Hadoop Ecosystem Data Processing Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

43 Hadoop Ecosystem Java(-ish) is the Hadoop language Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

44 Hadoop Ecosystem The good, the bad and the ugly The Good Easy to write parallel, highly scalable applications Stream and Batch processing Seamless integration with other systems (e.g. RDBMS) The Bad Rapid development, hard to keep overview 150 projects in (or near) Hadoop Eco System1 The Ugly Maven dependency hell if integrated with other systems Spark depends on > 50 libraries with a specific version! Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

45 Hadoop Ecosystem History of Hadoop Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

46 Hadoop Ecosystem System Design Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

47 Hadoop Ecosystem System Design Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

48 Hadoop Ecosystem Big Data Storage Technologies Big Data Storage Technologies File-based: HDFS distributed, permanent file storage tuned for large files no indexing Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

49 Hadoop Ecosystem Big Data Storage Technologies Big Data Storage Technologies File-based: HDFS distributed, permanent file storage tuned for large files no indexing Key-Value based: HBase distributed Key/Value store fast look-ups based on HDFS Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

50 Hadoop Ecosystem Big Data Storage Technologies Big Data Storage Technologies File-based: HDFS distributed, permanent file storage tuned for large files no indexing Key-Value based: HBase distributed Key/Value store fast look-ups based on HDFS Message-based: Kafka distributed Producer/Consumer messaging system data partitioned in topics producer groups / consumer groups Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

51 Hadoop Ecosystem Big Data Storage Technologies HDFS Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

52 Hadoop Ecosystem Big Data Storage Technologies HDFS Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

53 Hadoop Ecosystem Big Data Storage Technologies HBASE Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

54 Hadoop Ecosystem Big Data Storage Technologies HBASE Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

55 Hadoop Ecosystem Big Data Storage Technologies kafka Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

56 Hadoop Ecosystem Big Data Storage Technologies kafka Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

57 Hadoop Ecosystem Big Data Storage Technologies To Sum Up: Part 1: handling big data key elements: MapReduce, distributed file system Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

58 Hadoop Ecosystem Big Data Storage Technologies To Sum Up: Part 1: handling big data key elements: MapReduce, distributed file system Part 2: maximizing parallelism input data skew Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

59 Hadoop Ecosystem Big Data Storage Technologies To Sum Up: Part 1: handling big data key elements: MapReduce, distributed file system Part 2: maximizing parallelism input data skew Part 3: applications hadoop ecosystem Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

60 Hadoop Ecosystem Big Data Storage Technologies The End Next: Graph databases, Nov.28th Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41

MapReduce and the New Software Stack

MapReduce and the New Software Stack 20 Chapter 2 MapReduce and the New Software Stack Modern data-mining applications, often called big-data analysis, require us to manage immense amounts of data quickly. In many of these applications, the

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf

Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant

More information

Introduction to Big Data Training

Introduction to Big Data Training Introduction to Big Data Training The quickest way to be introduce with NOSQL/BIG DATA offerings Learn and experience Big Data Solutions including Hadoop HDFS, Map Reduce, NoSQL DBs: Document Based DB

More information

Hadoop Job Oriented Training Agenda

Hadoop Job Oriented Training Agenda 1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module

More information

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Real Time Fraud Detection With Sequence Mining on Big Data Platform Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Open Source Big Data Eco System Query (NOSQL) : Cassandra,

More information

BIG DATA TECHNOLOGY. Hadoop Ecosystem

BIG DATA TECHNOLOGY. Hadoop Ecosystem BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big

More information

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone Hadoop2, Spark Big Data, real time, machine learning & use cases Cédric Carbone Twitter : @carbone Agenda Map Reduce Hadoop v1 limits Hadoop v2 and YARN Apache Spark Streaming : Spark vs Storm Machine

More information

How Companies are! Using Spark

How Companies are! Using Spark How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made

More information

Oracle Big Data Fundamentals Ed 1 NEW

Oracle Big Data Fundamentals Ed 1 NEW Oracle University Contact Us: +90 212 329 6779 Oracle Big Data Fundamentals Ed 1 NEW Duration: 5 Days What you will learn In the Oracle Big Data Fundamentals course, learn to use Oracle's Integrated Big

More information

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the

More information

Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

More information

Infrastructures for big data

Infrastructures for big data Infrastructures for big data Rasmus Pagh 1 Today s lecture Three technologies for handling big data: MapReduce (Hadoop) BigTable (and descendants) Data stream algorithms Alternatives to (some uses of)

More information

Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia

Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current

More information

A Brief Introduction to Apache Tez

A Brief Introduction to Apache Tez A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

More information

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu Lecture 4 Introduction to Hadoop & GAE Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu Outline Introduction to Hadoop The Hadoop ecosystem Related projects

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de

Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft

More information

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford SQL VS. NO-SQL Adapted Slides from Dr. Jennifer Widom from Stanford 55 Traditional Databases SQL = Traditional relational DBMS Hugely popular among data analysts Widely adopted for transaction systems

More information

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable

More information

Native Connectivity to Big Data Sources in MSTR 10

Native Connectivity to Big Data Sources in MSTR 10 Native Connectivity to Big Data Sources in MSTR 10 Bring All Relevant Data to Decision Makers Support for More Big Data Sources Optimized Access to Your Entire Big Data Ecosystem as If It Were a Single

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Big Data: Tools and Technologies in Big Data

Big Data: Tools and Technologies in Big Data Big Data: Tools and Technologies in Big Data Jaskaran Singh Student Lovely Professional University, Punjab Varun Singla Assistant Professor Lovely Professional University, Punjab ABSTRACT Big data can

More information

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani Technical Architect - Big Data Syntel Agenda Welcome to the Zoo! Evolution Timeline Traditional BI/DW Architecture Where Hadoop Fits In 2 Welcome to

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately

More information

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Analysis of Web Archives. Vinay Goel Senior Data Engineer Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner

More information

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385 brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 1 Hadoop in a heartbeat 3 2 Introduction to YARN 22 PART 2 DATA LOGISTICS...59 3 Data serialization working with text and beyond 61 4 Organizing and

More information

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig Introduction to Pig Agenda What is Pig? Key Features of Pig The Anatomy of Pig Pig on Hadoop Pig Philosophy Pig Latin Overview Pig Latin Statements Pig Latin: Identifiers Pig Latin: Comments Data Types

More information

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy Native Connectivity to Big Data Sources in MicroStrategy 10 Presented by: Raja Ganapathy Agenda MicroStrategy supports several data sources, including Hadoop Why Hadoop? How does MicroStrategy Analytics

More information

Big Data and Industrial Internet

Big Data and Industrial Internet Big Data and Industrial Internet Keijo Heljanko Department of Computer Science and Helsinki Institute for Information Technology HIIT School of Science, Aalto University keijo.heljanko@aalto.fi 16.6-2015

More information

Big Data Analytics - Accelerated. stream-horizon.com

Big Data Analytics - Accelerated. stream-horizon.com Big Data Analytics - Accelerated stream-horizon.com StreamHorizon & Big Data Integrates into your Data Processing Pipeline Seamlessly integrates at any point of your your data processing pipeline Implements

More information

HPC ABDS: The Case for an Integrating Apache Big Data Stack

HPC ABDS: The Case for an Integrating Apache Big Data Stack HPC ABDS: The Case for an Integrating Apache Big Data Stack with HPC 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox gcf@indiana.edu http://www.infomall.org

More information

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1 Spark ΕΡΓΑΣΤΗΡΙΟ 10 Prepared by George Nikolaides 4/19/2015 1 Introduction to Apache Spark Another cluster computing framework Developed in the AMPLab at UC Berkeley Started in 2009 Open-sourced in 2010

More information

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang Graph Mining on Big Data System Presented by Hefu Chai, Rui Zhang, Jian Fang Outline * Overview * Approaches & Environment * Results * Observations * Notes * Conclusion Overview * What we have done? *

More information

Hadoop in the Enterprise

Hadoop in the Enterprise Hadoop in the Enterprise Modern Architecture with Hadoop 2 Jeff Markham Technical Director, APAC Hortonworks Hadoop Wave ONE: Web-scale Batch Apps relative % customers 2006 to 2012 Web-Scale Batch Applications

More information

Big Data and Scripting Systems build on top of Hadoop

Big Data and Scripting Systems build on top of Hadoop Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is

More information

Spark and the Big Data Library

Spark and the Big Data Library Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and

More information

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora {mbalassi, gyfora}@apache.org The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache

More information

Hadoop in Social Network Analysis - overview on tools and some best practices - Headline Goes Here

Hadoop in Social Network Analysis - overview on tools and some best practices - Headline Goes Here Hadoop in Social Network Analysis - overview on tools and some best practices - Headline Goes Here Speaker Name or Subhead Goes Here GridKa School 2013, Karlsruhe 2013-08-27 Mirko Kämpf mirko@cloudera.com

More information

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group HiBench Introduction Carson Wang (carson.wang@intel.com) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is

More information

Apache HBase. Crazy dances on the elephant back

Apache HBase. Crazy dances on the elephant back Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage

More information

Xiaoming Gao Hui Li Thilina Gunarathne

Xiaoming Gao Hui Li Thilina Gunarathne Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal

More information

Architectures for massive data management

Architectures for massive data management Architectures for massive data management Apache Spark Albert Bifet albert.bifet@telecom-paristech.fr October 20, 2015 Spark Motivation Apache Spark Figure: IBM and Apache Spark What is Apache Spark Apache

More information

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti

More information

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon. Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new

More information

The Internet of Things and Big Data: Intro

The Internet of Things and Big Data: Intro The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang The Big Data Ecosystem at LinkedIn Presented by Zhongfang Zhuang Based on the paper The Big Data Ecosystem at LinkedIn, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems Hadoop Ecosystem

More information

DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II. Matrix Algorithms DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

More information

Cloud Scale Distributed Data Storage. Jürmo Mehine

Cloud Scale Distributed Data Storage. Jürmo Mehine Cloud Scale Distributed Data Storage Jürmo Mehine 2014 Outline Background Relational model Database scaling Keys, values and aggregates The NoSQL landscape Non-relational data models Key-value Document-oriented

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

Data Services Advisory

Data Services Advisory Data Services Advisory Modern Datastores An Introduction Created by: Strategy and Transformation Services Modified Date: 8/27/2014 Classification: DRAFT SAFE HARBOR STATEMENT This presentation contains

More information

Big Data Analytics Hadoop and Spark

Big Data Analytics Hadoop and Spark Big Data Analytics Hadoop and Spark Shelly Garion, Ph.D. IBM Research Haifa 1 What is Big Data? 2 What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software

More information

Hadoop-BAM and SeqPig

Hadoop-BAM and SeqPig Hadoop-BAM and SeqPig Keijo Heljanko 1, André Schumacher 1,2, Ridvan Döngelci 1, Luca Pireddu 3, Matti Niemenmaa 1, Aleksi Kallio 4, Eija Korpelainen 4, and Gianluigi Zanetti 3 1 Department of Computer

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2

More information

Using distributed technologies to analyze Big Data

Using distributed technologies to analyze Big Data Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris

More information

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010 System/ Scale to Primary Secondary Joins/ Integrity Language/ Data Year Paper 1000s Index Indexes Transactions Analytics Constraints Views Algebra model my label 1971 RDBMS O tables sql-like 2003 memcached

More information

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges James Campbell Corporate Systems Engineer HP Vertica jcampbell@vertica.com Big

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Big Data Course Highlights

Big Data Course Highlights Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce

Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce Big Data and Hadoop Module 1: Introduction to Big Data and Hadoop Learn about Big Data and the shortcomings of the prevailing solutions for Big Data issues. You will also get to know, how Hadoop eradicates

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Databases 2 (VU) (707.030)

Databases 2 (VU) (707.030) Databases 2 (VU) (707.030) Introduction to NoSQL Denis Helic KMI, TU Graz Oct 14, 2013 Denis Helic (KMI, TU Graz) NoSQL Oct 14, 2013 1 / 37 Outline 1 NoSQL Motivation 2 NoSQL Systems 3 NoSQL Examples 4

More information

Graph Processing and Social Networks

Graph Processing and Social Networks Graph Processing and Social Networks Presented by Shu Jiayu, Yang Ji Department of Computer Science and Engineering The Hong Kong University of Science and Technology 2015/4/20 1 Outline Background Graph

More information

Big Data Technology CS 236620, Technion, Spring 2013

Big Data Technology CS 236620, Technion, Spring 2013 Big Data Technology CS 236620, Technion, Spring 2013 Structured Databases atop Map-Reduce Edward Bortnikov & Ronny Lempel Yahoo! Labs, Haifa Roadmap Previous class MR Implementation This class Query Languages

More information

the missing log collector Treasure Data, Inc. Muga Nishizawa

the missing log collector Treasure Data, Inc. Muga Nishizawa the missing log collector Treasure Data, Inc. Muga Nishizawa Muga Nishizawa (@muga_nishizawa) Chief Software Architect, Treasure Data Treasure Data Overview Founded to deliver big data analytics in days

More information

Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis

Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis Leonidas Fegaras University of Texas at Arlington http://mrql.incubator.apache.org/ 04/12/2015 Outline Who am

More information

Integrating Big Data into the Computing Curricula

Integrating Big Data into the Computing Curricula Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big

More information

I/O Considerations in Big Data Analytics

I/O Considerations in Big Data Analytics Library of Congress I/O Considerations in Big Data Analytics 26 September 2011 Marshall Presser Federal Field CTO EMC, Data Computing Division 1 Paradigms in Big Data Structured (relational) data Very

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

Advanced Big Data Analytics with R and Hadoop

Advanced Big Data Analytics with R and Hadoop REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

More information

HadoopRDF : A Scalable RDF Data Analysis System

HadoopRDF : A Scalable RDF Data Analysis System HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn

More information

Unified Big Data Analytics Pipeline. 连 城 lian@databricks.com

Unified Big Data Analytics Pipeline. 连 城 lian@databricks.com Unified Big Data Analytics Pipeline 连 城 lian@databricks.com What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an

More information

Ali Ghodsi Head of PM and Engineering Databricks

Ali Ghodsi Head of PM and Engineering Databricks Making Big Data Simple Ali Ghodsi Head of PM and Engineering Databricks Big Data is Hard: A Big Data Project Tasks Tasks Build a Hadoop cluster Challenges Clusters hard to setup and manage Build a data

More information

Dell In-Memory Appliance for Cloudera Enterprise

Dell In-Memory Appliance for Cloudera Enterprise Dell In-Memory Appliance for Cloudera Enterprise Hadoop Overview, Customer Evolution and Dell In-Memory Product Details Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert Armando_Acosta@Dell.com/

More information

Introduction to Spark

Introduction to Spark Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks) Quick Demo Quick Demo API Hooks Scala / Java All Java libraries *.jar http://www.scala- lang.org Python Anaconda: https://

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information