Databases 2 (VU) ( )
|
|
- Jessica Sharp
- 7 years ago
- Views:
Transcription
1 Databases 2 (VU) ( ) MapReduce (Part 3) Mark Kröll KTI, TU Graz Nov. 14, 2016 Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
2 Outline 1 Problems Suited for Map-Reduce Matrix-Vector Multiplication Relational-Algebra Operations 2 Hadoop Ecosystem Big Data Storage Technologies Slides are partially based on Mining Massive Datasets by Jure Leskovec Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
3 MapReduce: Applications MapReduce computation makes sense when files are large and rarely updated in place Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
4 MapReduce: Applications MapReduce computation makes sense when files are large and rarely updated in place not suitable when managing online sales (Amazon) the principal operations on Amazon data involve responding to searches for products, recording sales, and so on, processes that involve relatively little calculation and that change the database won t see MapReduce for handling Web requests (even if we have millions of users) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
5 MapReduce: Applications MapReduce computation makes sense when files are large and rarely updated in place not suitable when managing online sales (Amazon) the principal operations on Amazon data involve responding to searches for products, recording sales, and so on, processes that involve relatively little calculation and that change the database won t see MapReduce for handling Web requests (even if we have millions of users) however, you want to use MapReduce for analytic queries on the data generated by an e.g. Web application find users with similar buying patterns ranking search results Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
6 MapReduce: Applications computations such as analytic queries typically involve matrix operations original purpose for the MapReduce implementation was to execute large matrix-vector multiplications to calculate the PageRank Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
7 MapReduce: Applications computations such as analytic queries typically involve matrix operations original purpose for the MapReduce implementation was to execute large matrix-vector multiplications to calculate the PageRank matrix operations such as matrix-matrix and matrix-vector multiplications fit nicely into MapReduce programming model another important class of operations that can use MapReduce effectively are relational-algebra operations Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
8 MapReduce: Applications Matrix-Vector Multiplication Suppose we have an n n matrix M, whose element in row i and column j will be denoted m ij. Suppose we also have a vector v of length n, whose jth element is v j. Then the matrix-vector product is the vector x of length n, whose ith element is given by x i = n m ij v j j=1 Outline a Map-Reduce program that calculates the vector x. Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
9 Matrix-Vector Multiplication Matrix-Vector Multiplication Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
10 Matrix-Vector Multiplication Matrix-Vector Multiplication let us first assume that the vector v is large, but it still can fit into the memory the matrix M and the vector v will be each stored in a file of the DFS assume that the row-column coordinates of a matrix element (indices) can be discovered for example, each value is stored as a triple (i, j, m ij ) similarly, the position of v j can be discovered analogously Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
11 Matrix-Vector Multiplication Matrix-Vector Multiplication Map Function: the map function applies to one element of the matrix M the vector v is first read in its entirety and is available for all Map tasks at that compute node from each matrix element m ij the map function produces the key-value pair (i, m ij v j ) all terms of the sum that make up the component x i of the matrix-vector product will get the same key i Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
12 Matrix-Vector Multiplication Matrix-Vector Multiplication Map Function: the map function applies to one element of the matrix M the vector v is first read in its entirety and is available for all Map tasks at that compute node from each matrix element m ij the map function produces the key-value pair (i, m ij v j ) all terms of the sum that make up the component x i of the matrix-vector product will get the same key i Reduce Function: reduce function sums all the values associated with a given key i result is a pair (i, x i ) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
13 Matrix-Vector Multiplication Matrix-Vector Multiplication however, it might be that the vector v does not fit into main memory it is not required that the vector v fits into the memory at a compute node, but if it does not there will be a very large number of disk accesses as we move pieces of the vector into main memory Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
14 Matrix-Vector Multiplication Matrix-Vector Multiplication however, it might be that the vector v does not fit into main memory it is not required that the vector v fits into the memory at a compute node, but if it does not there will be a very large number of disk accesses as we move pieces of the vector into main memory alternatively we can divide the matrix M into vertical stripes of equal width and divide the vector into an equal number of horizontal stripes of the same height use enough stripes so that the portion of the vector in one stripe can fit into main memory at a compute node Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
15 Matrix-Vector Multiplication Matrix-Vector Multiplication Figure: Divide matrix M and vector v into stripes Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
16 Matrix-Vector Multiplication Matrix-Vector Multiplication the ith stripe of matrix M multiplies only components from the ith stripe of the vector can divide matrix M into one file for each stripe, and do the same for the vector v each Map task is assigned a chunk from one of the stripes in the matrix and gets the entire corresponding stripe of the vector Map and Reduce tasks can then act exactly as before need to sum up once more the results of the stripes multiplication Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
17 Relational-Algebra Operations Relational-Algebra Operations many operation on data can be described easily in terms of the common database-query primitives the queries themselves must not be executed within a DBMS e.g. standard operations on relations such as selection Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
18 Relational-Algebra Operations Relational-Algebra Operations many operation on data can be described easily in terms of the common database-query primitives the queries themselves must not be executed within a DBMS e.g. standard operations on relations such as selection a relation is a table with column headers called attributes the set of attributes of a relation R is called its schema: R(A 1, A 2,..., A n ) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
19 Relational-Algebra Operations Relation Links From To url1 url2 url1 url3 url2 url3 url2 url Table: The relation consists of the set of pairs of URL s, such that the first has one or more links to the second Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
20 Relational-Algebra Operations Relation Links a tuple is a pair of URLs such that there is at least one link from the first to the second URL the first row (url1, url2) states that the Web page at url1 points to the Web page at url2 a similar relation is typically stored by a search engine (with billions of tuples) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
21 Relational-Algebra Operations Relational-Algebra Standard operation on relations are 1 Selection(σ): apply a condition C to each tuple and output only tuples that satisfy C 2 Projection (π): produce from each tuple only a subset S of attributes 3 Union, Intersection, Difference: set operations on tuples 4 Natural Join ( ): Given two relations compare each pair of tuples and output those that agree on all common attributes 5 Grouping and Aggregation (γ, θ): partition the tuples in a relation according to their values in a set of attributes. For each group perform one of the operations such as Sum, Count, Avg, Min or Max Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
22 Relational-Algebra Operations Example 1: Paths of length 2 find paths of length 2 in the Web using the Links relation in other words find triples of URLs (u, v, w) such that there is a link between u and v and a link between v and w we want to take natural join of Links with itself let us describe this with two copies of Links: L1(U1, U2) and L2(U2, U3) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
23 Relational-Algebra Operations Example 1: Paths of length 2 now we compute L1(U1, U2) L2(U2, U3) for each tuple t1 of L1 and each tuple t2 of L2, we see if their U2 components are same these components are the second component of t1 and the first component of t2) if these two components agree, we produce (U1, U2, U3) as a result if we want only to check for the existence of the path of length two we might want to project onto U1 and U3 π U1,U3 (L1 L2) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
24 Relational-Algebra Operations Example 2: Number of friends imagine that a social-networking site has a relation Friends(User, Friend) suppose we want to calculate the statistics about the number of friends of each user in terms of relational algebra we would perform grouping and aggregation: γ User,COUNT (Friend) (Friends) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
25 Relational-Algebra Operations Example 2: Number of friends imagine that a social-networking site has a relation Friends(User, Friend) suppose we want to calculate the statistics about the number of friends of each user in terms of relational algebra we would perform grouping and aggregation: γ User,COUNT (Friend) (Friends) this operation groups all tuples by the value of the first component and then counts the number of friends one tuple for each group, and a typical tuple would look like (Sally, 300), if user Sally has 300 friends Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
26 Relational-Algebra Operations Selection by MapReduce Selection given is a relation R; we want to compute σ C (R); can be done most conveniently in the map part alone Map Function: for each tuple t in R, test if it satisfies C if so produce the key value pair (t, t) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
27 Relational-Algebra Operations Selection by MapReduce Selection given is a relation R; we want to compute σ C (R); can be done most conveniently in the map part alone Map Function: for each tuple t in R, test if it satisfies C if so produce the key value pair (t, t) Reduce Function: the reduce function is identity it simply passes each key-value pair to the output Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
28 Relational-Algebra Operations Projection by Map-Reduce Projection given is a relation R; we want to compute π S (R) Map Function: for each tuple t in R construct a tuple t by eliminating from t those components that are not in projection S output the key-value pair (t, t ) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
29 Relational-Algebra Operations Projection by Map-Reduce Projection given is a relation R; we want to compute π S (R) Map Function: for each tuple t in R construct a tuple t by eliminating from t those components that are not in projection S output the key-value pair (t, t ) Reduce Function: for each key t there will be one or more key-value pairs (t, t ) the reduce function turns (t, [t, t,..., t ]) into (t, t ) the Reduce operation equals a duplicate elimination Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
30 Relational-Algebra Operations Natural-Join by Map-Reduce Natural-Join given are relations R(A, B) and S(B, C); we want to compute R S must find tuples that agree on their B components Map Function: for each tuple (a, b) of R produce the key-value pair (b, (R, a)) for each tuple (b, c) of S produce the key-value pair (b, (S, c)) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
31 Relational-Algebra Operations Natural-Join by Map-Reduce Natural-Join given are relations R(A, B) and S(B, C); we want to compute R S must find tuples that agree on their B components Map Function: for each tuple (a, b) of R produce the key-value pair (b, (R, a)) for each tuple (b, c) of S produce the key-value pair (b, (S, c)) Reduce Function: each key b will be associated with a list of pairs that are either of the form (R, a) or (S, c) construct all pairs consisting of the values (a, b, c) the challenge is to convert one s task in a way that it can be processed by MapReduce; so that it adheres to its internal key/value structure Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
32 Relational-Algebra Operations Grouping and Aggregation by Map-Reduce Grouping and Aggregation given is a relation R(A, B, C); we want to compute γ A,θ(B) (R) Map Function: Map produces the grouping for each tuple (a, b, c) produce the key-value pair (a, b) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
33 Relational-Algebra Operations Grouping and Aggregation by Map-Reduce Grouping and Aggregation given is a relation R(A, B, C); we want to compute γ A,θ(B) (R) Map Function: Map produces the grouping for each tuple (a, b, c) produce the key-value pair (a, b) Reduce Function: reduce function produces the aggregation each key a represents a group apply the aggregation operator θ to the list [b 1, b 2,..., b n ] of the values associated with a output is a pair (a, x), where x is the result of θ applied to the list Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
34 Hadoop Ecosystem Hadoop Eco System (v1) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
35 Hadoop Ecosystem Hadoop Eco System (v1) HBase open source, non-relational, distributed database modeled after Google s BigTable and is written in Java provides a fault-tolerant way of storing large quantities of sparse data Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
36 Hadoop Ecosystem Hadoop Eco System (v1) HBase Hive open source, non-relational, distributed database modeled after Google s BigTable and is written in Java provides a fault-tolerant way of storing large quantities of sparse data data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis a data warehouse is a system used for reporting and data analysis Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
37 Hadoop Ecosystem Hadoop Eco System (v1) HBase Hive Pig open source, non-relational, distributed database modeled after Google s BigTable and is written in Java provides a fault-tolerant way of storing large quantities of sparse data data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis a data warehouse is a system used for reporting and data analysis high-level platform for creating programs that run on Apache Hadoop (language is called Pig Latin) abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
38 Hadoop Ecosystem Hadoop Eco System (v2) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
39 Hadoop Ecosystem Hadoop Eco System (v2) Yarn (Yet Another Resource Negotiator) is a cluster management system to run Big Data applications on a cluster data; not a data processing platform itself but enables the platforms to run their code in a cluster environment Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
40 Hadoop Ecosystem Hadoop Eco System (v2) Yarn (Yet Another Resource Negotiator) is a cluster management system to run Big Data applications on a cluster data; not a data processing platform itself but enables the platforms to run their code in a cluster environment Spark, Flink, Storm are frameworks for cluster computing specializing in either batch processing, stream processing or both (see next slide) Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
41 Hadoop Ecosystem Hadoop Eco System (v2) Yarn (Yet Another Resource Negotiator) is a cluster management system to run Big Data applications on a cluster data; not a data processing platform itself but enables the platforms to run their code in a cluster environment Spark, Flink, Storm Giraph Impala are frameworks for cluster computing specializing in either batch processing, stream processing or both (see next slide) utilizes Apache Hadoop s MapReduce implementation to process graphs is Cloudera s open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
42 Hadoop Ecosystem Data Processing Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
43 Hadoop Ecosystem Java(-ish) is the Hadoop language Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
44 Hadoop Ecosystem The good, the bad and the ugly The Good Easy to write parallel, highly scalable applications Stream and Batch processing Seamless integration with other systems (e.g. RDBMS) The Bad Rapid development, hard to keep overview 150 projects in (or near) Hadoop Eco System1 The Ugly Maven dependency hell if integrated with other systems Spark depends on > 50 libraries with a specific version! Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
45 Hadoop Ecosystem History of Hadoop Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
46 Hadoop Ecosystem System Design Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
47 Hadoop Ecosystem System Design Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
48 Hadoop Ecosystem Big Data Storage Technologies Big Data Storage Technologies File-based: HDFS distributed, permanent file storage tuned for large files no indexing Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
49 Hadoop Ecosystem Big Data Storage Technologies Big Data Storage Technologies File-based: HDFS distributed, permanent file storage tuned for large files no indexing Key-Value based: HBase distributed Key/Value store fast look-ups based on HDFS Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
50 Hadoop Ecosystem Big Data Storage Technologies Big Data Storage Technologies File-based: HDFS distributed, permanent file storage tuned for large files no indexing Key-Value based: HBase distributed Key/Value store fast look-ups based on HDFS Message-based: Kafka distributed Producer/Consumer messaging system data partitioned in topics producer groups / consumer groups Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
51 Hadoop Ecosystem Big Data Storage Technologies HDFS Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
52 Hadoop Ecosystem Big Data Storage Technologies HDFS Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
53 Hadoop Ecosystem Big Data Storage Technologies HBASE Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
54 Hadoop Ecosystem Big Data Storage Technologies HBASE Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
55 Hadoop Ecosystem Big Data Storage Technologies kafka Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
56 Hadoop Ecosystem Big Data Storage Technologies kafka Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
57 Hadoop Ecosystem Big Data Storage Technologies To Sum Up: Part 1: handling big data key elements: MapReduce, distributed file system Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
58 Hadoop Ecosystem Big Data Storage Technologies To Sum Up: Part 1: handling big data key elements: MapReduce, distributed file system Part 2: maximizing parallelism input data skew Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
59 Hadoop Ecosystem Big Data Storage Technologies To Sum Up: Part 1: handling big data key elements: MapReduce, distributed file system Part 2: maximizing parallelism input data skew Part 3: applications hadoop ecosystem Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
60 Hadoop Ecosystem Big Data Storage Technologies The End Next: Graph databases, Nov.28th Mark Kröll (KTI, TU Graz) MapReduce Nov. 14, / 41
MapReduce and the New Software Stack
20 Chapter 2 MapReduce and the New Software Stack Modern data-mining applications, often called big-data analysis, require us to manage immense amounts of data quickly. In many of these applications, the
More informationHadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
More informationHadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationHow To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationProgramming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
More informationScaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
More informationIntroduction to Big Data Training
Introduction to Big Data Training The quickest way to be introduce with NOSQL/BIG DATA offerings Learn and experience Big Data Solutions including Hadoop HDFS, Map Reduce, NoSQL DBs: Document Based DB
More informationHadoop Job Oriented Training Agenda
1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module
More informationApache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
More informationReal Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA
Real Time Fraud Detection With Sequence Mining on Big Data Platform Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Open Source Big Data Eco System Query (NOSQL) : Cassandra,
More informationBIG DATA TECHNOLOGY. Hadoop Ecosystem
BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big
More informationHadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone
Hadoop2, Spark Big Data, real time, machine learning & use cases Cédric Carbone Twitter : @carbone Agenda Map Reduce Hadoop v1 limits Hadoop v2 and YARN Apache Spark Streaming : Spark vs Storm Machine
More informationHow Companies are! Using Spark
How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made
More informationOracle Big Data Fundamentals Ed 1 NEW
Oracle University Contact Us: +90 212 329 6779 Oracle Big Data Fundamentals Ed 1 NEW Duration: 5 Days What you will learn In the Oracle Big Data Fundamentals course, learn to use Oracle's Integrated Big
More informationLecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl
Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the
More informationWorkshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
More informationInfrastructures for big data
Infrastructures for big data Rasmus Pagh 1 Today s lecture Three technologies for handling big data: MapReduce (Hadoop) BigTable (and descendants) Data stream algorithms Alternatives to (some uses of)
More informationUnified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
More informationBIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
More informationHadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN
Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current
More informationA Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
More informationCloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu
Lecture 4 Introduction to Hadoop & GAE Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu Outline Introduction to Hadoop The Hadoop ecosystem Related projects
More informationArchitectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
More informationBig Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
More informationHadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
More informationSystems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de
Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft
More informationSQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford
SQL VS. NO-SQL Adapted Slides from Dr. Jennifer Widom from Stanford 55 Traditional Databases SQL = Traditional relational DBMS Hugely popular among data analysts Widely adopted for transaction systems
More informationBig Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park
Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable
More informationNative Connectivity to Big Data Sources in MSTR 10
Native Connectivity to Big Data Sources in MSTR 10 Bring All Relevant Data to Decision Makers Support for More Big Data Sources Optimized Access to Your Entire Big Data Ecosystem as If It Were a Single
More informationChapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
More informationBig Data: Tools and Technologies in Big Data
Big Data: Tools and Technologies in Big Data Jaskaran Singh Student Lovely Professional University, Punjab Varun Singla Assistant Professor Lovely Professional University, Punjab ABSTRACT Big data can
More informationA Tour of the Zoo the Hadoop Ecosystem Prafulla Wani
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani Technical Architect - Big Data Syntel Agenda Welcome to the Zoo! Evolution Timeline Traditional BI/DW Architecture Where Hadoop Fits In 2 Welcome to
More informationMap Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
More informationAnalysis of Web Archives. Vinay Goel Senior Data Engineer
Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner
More informationbrief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385
brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 1 Hadoop in a heartbeat 3 2 Introduction to YARN 22 PART 2 DATA LOGISTICS...59 3 Data serialization working with text and beyond 61 4 Organizing and
More informationBig Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig
Introduction to Pig Agenda What is Pig? Key Features of Pig The Anatomy of Pig Pig on Hadoop Pig Philosophy Pig Latin Overview Pig Latin Statements Pig Latin: Identifiers Pig Latin: Comments Data Types
More informationNative Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy
Native Connectivity to Big Data Sources in MicroStrategy 10 Presented by: Raja Ganapathy Agenda MicroStrategy supports several data sources, including Hadoop Why Hadoop? How does MicroStrategy Analytics
More informationBig Data and Industrial Internet
Big Data and Industrial Internet Keijo Heljanko Department of Computer Science and Helsinki Institute for Information Technology HIIT School of Science, Aalto University keijo.heljanko@aalto.fi 16.6-2015
More informationBig Data Analytics - Accelerated. stream-horizon.com
Big Data Analytics - Accelerated stream-horizon.com StreamHorizon & Big Data Integrates into your Data Processing Pipeline Seamlessly integrates at any point of your your data processing pipeline Implements
More informationHPC ABDS: The Case for an Integrating Apache Big Data Stack
HPC ABDS: The Case for an Integrating Apache Big Data Stack with HPC 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox gcf@indiana.edu http://www.infomall.org
More informationSpark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1
Spark ΕΡΓΑΣΤΗΡΙΟ 10 Prepared by George Nikolaides 4/19/2015 1 Introduction to Apache Spark Another cluster computing framework Developed in the AMPLab at UC Berkeley Started in 2009 Open-sourced in 2010
More informationGraph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang
Graph Mining on Big Data System Presented by Hefu Chai, Rui Zhang, Jian Fang Outline * Overview * Approaches & Environment * Results * Observations * Notes * Conclusion Overview * What we have done? *
More informationHadoop in the Enterprise
Hadoop in the Enterprise Modern Architecture with Hadoop 2 Jeff Markham Technical Director, APAC Hortonworks Hadoop Wave ONE: Web-scale Batch Apps relative % customers 2006 to 2012 Web-Scale Batch Applications
More informationBig Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is
More informationSpark and the Big Data Library
Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and
More informationThe Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
More informationHadoop in Social Network Analysis - overview on tools and some best practices - Headline Goes Here
Hadoop in Social Network Analysis - overview on tools and some best practices - Headline Goes Here Speaker Name or Subhead Goes Here GridKa School 2013, Karlsruhe 2013-08-27 Mirko Kämpf mirko@cloudera.com
More informationHiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group
HiBench Introduction Carson Wang (carson.wang@intel.com) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
More informationApache HBase. Crazy dances on the elephant back
Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage
More informationXiaoming Gao Hui Li Thilina Gunarathne
Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal
More informationArchitectures for massive data management
Architectures for massive data management Apache Spark Albert Bifet albert.bifet@telecom-paristech.fr October 20, 2015 Spark Motivation Apache Spark Figure: IBM and Apache Spark What is Apache Spark Apache
More informationData-Intensive Programming. Timo Aaltonen Department of Pervasive Computing
Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti
More informationBuilding Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.
Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new
More informationThe Internet of Things and Big Data: Intro
The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific
More informationJeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
More informationHadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
More informationThe Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang
The Big Data Ecosystem at LinkedIn Presented by Zhongfang Zhuang Based on the paper The Big Data Ecosystem at LinkedIn, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems Hadoop Ecosystem
More informationDATA ANALYSIS II. Matrix Algorithms
DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where
More informationCloud Scale Distributed Data Storage. Jürmo Mehine
Cloud Scale Distributed Data Storage Jürmo Mehine 2014 Outline Background Relational model Database scaling Keys, values and aggregates The NoSQL landscape Non-relational data models Key-value Document-oriented
More informationHow To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
More informationArchitectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
More informationINTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
More informationConstructing a Data Lake: Hadoop and Oracle Database United!
Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.
More informationIntroduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
More informationA Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
More informationData Services Advisory
Data Services Advisory Modern Datastores An Introduction Created by: Strategy and Transformation Services Modified Date: 8/27/2014 Classification: DRAFT SAFE HARBOR STATEMENT This presentation contains
More informationBig Data Analytics Hadoop and Spark
Big Data Analytics Hadoop and Spark Shelly Garion, Ph.D. IBM Research Haifa 1 What is Big Data? 2 What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software
More informationHadoop-BAM and SeqPig
Hadoop-BAM and SeqPig Keijo Heljanko 1, André Schumacher 1,2, Ridvan Döngelci 1, Luca Pireddu 3, Matti Niemenmaa 1, Aleksi Kallio 4, Eija Korpelainen 4, and Gianluigi Zanetti 3 1 Department of Computer
More informationAssociate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2
Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue
More informationData processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
More informationChallenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
More informationUsing distributed technologies to analyze Big Data
Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/
More informationDepartment of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris
More informationextensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010
System/ Scale to Primary Secondary Joins/ Integrity Language/ Data Year Paper 1000s Index Indexes Transactions Analytics Constraints Views Algebra model my label 1971 RDBMS O tables sql-like 2003 memcached
More informationSession 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges
Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges James Campbell Corporate Systems Engineer HP Vertica jcampbell@vertica.com Big
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationBig Data Course Highlights
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
More informationHadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
More informationBig Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce
Big Data and Hadoop Module 1: Introduction to Big Data and Hadoop Learn about Big Data and the shortcomings of the prevailing solutions for Big Data issues. You will also get to know, how Hadoop eradicates
More informationESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
More informationPro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah
Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big
More informationCloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
More informationDatabases 2 (VU) (707.030)
Databases 2 (VU) (707.030) Introduction to NoSQL Denis Helic KMI, TU Graz Oct 14, 2013 Denis Helic (KMI, TU Graz) NoSQL Oct 14, 2013 1 / 37 Outline 1 NoSQL Motivation 2 NoSQL Systems 3 NoSQL Examples 4
More informationGraph Processing and Social Networks
Graph Processing and Social Networks Presented by Shu Jiayu, Yang Ji Department of Computer Science and Engineering The Hong Kong University of Science and Technology 2015/4/20 1 Outline Background Graph
More informationBig Data Technology CS 236620, Technion, Spring 2013
Big Data Technology CS 236620, Technion, Spring 2013 Structured Databases atop Map-Reduce Edward Bortnikov & Ronny Lempel Yahoo! Labs, Haifa Roadmap Previous class MR Implementation This class Query Languages
More informationthe missing log collector Treasure Data, Inc. Muga Nishizawa
the missing log collector Treasure Data, Inc. Muga Nishizawa Muga Nishizawa (@muga_nishizawa) Chief Software Architect, Treasure Data Treasure Data Overview Founded to deliver big data analytics in days
More informationApache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis
Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis Leonidas Fegaras University of Texas at Arlington http://mrql.incubator.apache.org/ 04/12/2015 Outline Who am
More informationIntegrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
More informationI/O Considerations in Big Data Analytics
Library of Congress I/O Considerations in Big Data Analytics 26 September 2011 Marshall Presser Federal Field CTO EMC, Data Computing Division 1 Paradigms in Big Data Structured (relational) data Very
More informationOutline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
More informationAdvanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
More informationHadoopRDF : A Scalable RDF Data Analysis System
HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn
More informationUnified Big Data Analytics Pipeline. 连 城 lian@databricks.com
Unified Big Data Analytics Pipeline 连 城 lian@databricks.com What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an
More informationAli Ghodsi Head of PM and Engineering Databricks
Making Big Data Simple Ali Ghodsi Head of PM and Engineering Databricks Big Data is Hard: A Big Data Project Tasks Tasks Build a Hadoop cluster Challenges Clusters hard to setup and manage Build a data
More informationDell In-Memory Appliance for Cloudera Enterprise
Dell In-Memory Appliance for Cloudera Enterprise Hadoop Overview, Customer Evolution and Dell In-Memory Product Details Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert Armando_Acosta@Dell.com/
More informationIntroduction to Spark
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks) Quick Demo Quick Demo API Hooks Scala / Java All Java libraries *.jar http://www.scala- lang.org Python Anaconda: https://
More informationLambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
More information