Infrastructures for big data

Similar documents

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Cloud Computing at Google. Architecture

Big Data With Hadoop

Domain driven design, NoSQL and multi-model databases

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

Lecture Data Warehouse Systems

How To Scale Out Of A Nosql Database

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Challenges for Data Driven Systems

Cloud Scale Distributed Data Storage. Jürmo Mehine

Big Data and Apache Hadoop s MapReduce

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Applications for Big Data Analytics

Introduction to Hadoop

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

NoSQL. Thomas Neumann 1 / 22

CISC 432/CMPE 432/CISC 832 Advanced Database Systems

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

Open source large scale distributed data management with Google s MapReduce and Bigtable

Hadoop Ecosystem B Y R A H I M A.

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

How To Improve Performance In A Database

Structured Data Storage

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Big Data Processing with Google s MapReduce. Alexandru Costan

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Hadoop IST 734 SS CHUNG

Big Systems, Big Data

BIG DATA What it is and how to use?

Databases 2 (VU) ( )

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

MongoDB in the NoSQL and SQL world. Horst Rechner Berlin,

How To Handle Big Data With A Data Scientist

How To Write A Database Program

INTRODUCTION TO CASSANDRA

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Google Bing Daytona Microsoft Research

Apache HBase. Crazy dances on the elephant back

Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Real Time Big Data Processing

Joining Cassandra. Luiz Fernando M. Schlindwein Computer Science Department University of Crete Heraklion, Greece

Internals of Hadoop Application Framework and Distributed File System

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Comparing SQL and NOSQL databases

Similarity Search in a Very Large Scale Using Hadoop and HBase

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Chapter 7. Using Hadoop Cluster and MapReduce

Using distributed technologies to analyze Big Data

NoSQL Data Base Basics

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

How To Use Big Data For Telco (For A Telco)

nosql and Non Relational Databases

Cloudera Certified Developer for Apache Hadoop

Big Data and Scripting map/reduce in Hadoop

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Can the Elephants Handle the NoSQL Onslaught?

The Hadoop Eco System Shanghai Data Science Meetup

Log Mining Based on Hadoop s Map and Reduce Technique

Hadoop Parallel Data Processing

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Jeffrey D. Ullman slides. MapReduce for data intensive computing

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Sentimental Analysis using Hadoop Phase 2: Week 2

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Architectures for Big Data Analytics A database perspective

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

API Analytics with Redis and Google Bigquery. javier

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #13: NoSQL and MapReduce

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Analysis of MapReduce Algorithms

What is Analytic Infrastructure and Why Should You Care?

Real-time Big Data Analytics with Storm

Hypertable Architecture Overview

Rakam: Distributed Analytics API

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Introduction to Hadoop

Big Data and Scripting Systems build on top of Hadoop

Cassandra A Decentralized Structured Storage System

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Big Data and Data Science: Behind the Buzz Words

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Transcription:

Infrastructures for big data Rasmus Pagh 1

Today s lecture Three technologies for handling big data: MapReduce (Hadoop) BigTable (and descendants) Data stream algorithms Alternatives to (some uses of) DBMSs. Part of larger trend Not Only SQL. 2

NO-SQL Silly to indentify technologies with what they are not. Better: Not Only SQL. But what is it? Lemire: Programmer s revolt against database administrators. Common reason: Independence from very expensive large DBMSs. 4

MapReduce Google system for distributed queries on line-based data. Runs on a cluster of networked machines (can be 1000s). Open source version: Hadoop Builds on distributed file system. Does not deal with transactions. 5

Mapreduce programming model 1. For each input line, Map to zero, one, or more (key,value) pairs. 6

MapReduce execution 7

MapReduce examples 1. Word count Mapper: Transform text lines into pairs (w,1). Reducer: Add the occurrences of each word. 2. SQL aggregation Can implement queries of the form: SELECT myreducer FROM mymapper(r) GROUP BY key 8

Example: Inverted lists 9

Example: Listing triangles Example: Facebook friend graph, lists all pairs {u,v} where u and v are friends. Triangle: Three mutual friends 10

Problem session Suppose you want to compute a join of R 1 (a,b) and R 2 (b,c), on attribute b. 1. How could the input be represented to fit the MapReduce framework? 2. How can the join be computed? Specify a mapper Specify a reducer 11

BigTable Google system for storing and accessing data persistently in a distributed system. Highly scalable on clusters of cheap machines - add machines to scale up. Highly fault tolerant (replication). Like other distributed storage systems offers relaxed consistency compared to a DBMS ( eventual consistency ). 12

Bigtable in a nutshell Small subset of DBMS functionality ( meet 7 out of 8 demands ). Data model generalizes relational one: (column:string,rowid:string,time:int) string Stored sorted lexicographically by key." Only simple queries and transactions: Lookup string using rowid and column. Transaction: Modify a single row. 13

More BigTable Many similar systems have followed (distributed hash tables, Cassandra, DynamoDB, Hbase ). The course page links to a nice presentation by Jeff Dean, one the the system s main engineers. The first 17 minutes give a good overview. The rest is technical details outside the scope of this class. 14

Data streams In some applications (networks, sensors) data is produces so fast that normal techniques cannot keep up. Data model: Stream of data items e.g. tuples, numbers, graph edges, Processing model: One pass over data (cannot go back). Memory only large enough to store tiny part of data. Instead, we store a sketch or summary that encodes enough information. 15

Things easy on a stream Count the number of data items. Compute aggregates of numbers Sum Average Maximum, minimum, top-k Variance Select tuples satisfying a condition. Select sample with 1% of data items. Split into several streams. More? 16

Primitive: Heavy hitters Find frequency of each data item with error at most f. Can say frequency zero if frequency < f. Space usage is around 1/f. Extension to allow a weight for each item. Example: Stream consists of (country,amount) pairs. Want to report the countries that account for a fraction f of the total amount. 17

Primitive: Distinct elements Number of distinct elements in stream? Answer should be correct within k% error. Amazing: Space usage does not depend on the length of the stream! Examples: Estimating result size in a DBMS. 18

Primitive: Euclidian distance Given two data streams of numbers, how similar are they? View each stream as a point in n- dimensional space. Want to know the distance between points, again with k% error allowed. Possible in space that depends only on the precision k and log(n) For large n, log n is a constant in practice. 19

Data stream differences Many data stream algorithms lets you look at the difference of two streams. Examples: Which items have significantly higher frequency in s1 than in s2? If each item occurs only once in a stream: How many items were in s1 but not in s2? Application: Anomaly detection. 20

Putting primitives together Can build data stream algorithms by combining or pipelining primitives. Examples: Number of distinct customers per shop (split stream, distinct elements). Average number of times an item occurs (count total length, distinct elements). Countries where the sales have increased by > 1 million compared to last month (store last month s sketch, take diff). 21

Where are the stream systems? Prototype systems such as Stanford Stream Data Manager (CQL) Gigascope: System used at AT&T. Real-time databases may work on data streams, but are typically updateoptimized DBMSs. 22

Course goals After the course the students should be able to: formulate an analytics task as a sequence of operations on a stream, or in the MapReduce framework. 23

Next steps This afternoon: Last exercise session Exam-type exercise. Midnight: Deadline for hand-in 4, and re-submission of hand-in 3. Next week: I encourage you to sit down and do problems 1, 3, 4, and 5 of the exam from January 2012. This should take (at most) about 3 hours. At the lecture I will go through the exam and how it is graded. 24