Big Data Systems CS 5965/6965 FALL 2015



Similar documents
Big Data Systems CS 5965/6965 FALL 2014

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

CSE 427 CLOUD COMPUTING WITH BIG DATA APPLICATIONS

Big Data Analytics. Lucas Rego Drumond

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Chapter 7. Using Hadoop Cluster and MapReduce

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Hybrid Software Architectures for Big

Architectures for Big Data Analytics A database perspective

Big Graph Processing: Some Background

Mining Large Datasets: Case of Mining Graph Data in the Cloud

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Storage Architectures for Big Data in the Cloud

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

CSE-E5430 Scalable Cloud Computing Lecture 2

How To Scale Out Of A Nosql Database

Machine Learning over Big Data

Big Data Analytics Using R

Lecture 10 - Functional programming: Hadoop and MapReduce

Apache Hadoop. Alexandru Costan

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Energy Efficient MapReduce

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

Big Data and Analytics (Fall 2015)

Challenges for Data Driven Systems

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Big Fast Data Hadoop acceleration with Flash. June 2013

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Software Performance and Scalability

Duke University

Lecture Data Warehouse Systems

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

L1: Introduction to Hadoop

Hadoop. Sunday, November 25, 12

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

NoSQL for SQL Professionals William McKnight

Parallel Computing. Benson Muite. benson.

Map-Reduce for Machine Learning on Multicore

Search Engine Architecture

BIG DATA What it is and how to use?

Managing large clusters resources

Hadoop IST 734 SS CHUNG

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Big Data With Hadoop

Large-Scale Data Processing

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

MapReduce and Hadoop Distributed File System V I J A Y R A O

Cloud Computing Summary and Preparation for Examination

CS 575 Parallel Processing

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

System Models for Distributed and Cloud Computing

Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop

Spring 2011 Prof. Hyesoon Kim

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Data Mining in the Swamp

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Spark. Fast, Interactive, Language- Integrated Cluster Computing

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

Price/performance Modern Memory Hierarchy

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

Topics in Computer System Performance and Reliability: Storage Systems!

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

A Brief Introduction to Apache Tez

Hadoop Architecture. Part 1

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Architectures for massive data management

Spark: Cluster Computing with Working Sets

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Hadoop and Map-Reduce. Swati Gore

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Amazon EC2 Product Details Page 1 of 5

HiBench Introduction. Carson Wang Software & Services Group

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

B490 Mining the Big Data. 0 Introduction

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

In Memory Accelerator for MongoDB

A Comparison of Approaches to Large-Scale Data Analysis

A survey of big data architectures for handling massive data

Hadoop Cluster Applications

Hadoop Ecosystem B Y R A H I M A.

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Transcription:

Big Data Systems CS 5965/6965 FALL 2015

Today General course overview Expectations from this course Q&A Introduction to Big Data Assignment #1

General Course Information Course Web Page http://www.cs.utah.edu/~hari/teaching/fall2015.html also on canvas Text & References Mining of Massive Datasets, Leskovec, Rajaraman, Ullman will use online resources and papers CHPC accounts fill this form asap http://goo.gl/forms/nt8tkftijq TA Vairavan Sivaraman Email vairavan.sivaraman@utah.edu Office hours

Class Interaction I strongly encourage discussions & interactions 10% for class interaction and online participation

Academic conduct Adherence to the CoE and SoC academic guidelines is expected. Please read the following. College of Engineering Academic Guidelines from their web page School of Computing Misconduct Policy Please Read TL;DR NO CHEATING

Exams, Assignments & Projects Assignment 0 Due in 1 week Assignments 1-2 one every week Simple spark problems Assignments 3,4 Larger problems 2 weeks each Final Project submit proposal before start of fall break Default project 2-3 students per group Mid-term exam after fall break

Things you should know No formal prerequisite Good Programming Skills (any language) Python/Scala/Java Make a case if you would like to use any other language Be prepared to learn new programming tools & techniques Linear Algebra Sequential algorithms, complexity analysis

What will we cover? Hadoop & Spark MapReduce Parallel Algorithms Searching & Sorting in Parallel Randomized Algorithms - Graph Coloring PageRank Clustering Data Recommendation Systems Matrix Factorization Social Networks Graph Algorithms Large-Scale Machine Learning

Expectations The focus will be on understanding Big data algorithms and ensuring scalability for large scale problems While we will use Spark/Python, the skills acquired should be transferable to other platforms/frameworks While I will show you code examples, this is not a course to take if you want to learn to program in Java/Python/Scala/. Programming assignments will be frequent and hard. Additionally, the cluster is unreliable, so do not wait until the last evening to test your code. Choose the default project if you

Where to look for help Course website / Canvas Email TA office hours MEB 3XXX My office hours MEB 3454 Tue,Thu 2-3pm Questions?

What is Big Data?

What is Big Data? Volume Variety Veracity Velocity

The recognition that data is at the center of our digital world and that there are big challenges in collecting, storing, processing, analyzing, and making use of such data. What is Big depends on the application domain

Kinds of Data Web data & web data accesses Emails, chats, tweet Telephone data Public databases Gene banks, census records, Private databases medical records, credit card transactions, Sensor data camera surveillance, wearable sensors, seismograms Byproduct of computer systems operations power signal, CPU events,

Data is valuable Google & facebook make money by mining user data for revenue from advertisements Financial firms analyze financial records, real-time transactions and current events for profitable trades Medical records can be processed for better and potentially cheaper health care Roadways are monitored for traffic analysis and control Face detection in airports

Collection of Big Data The amount of data available is increasing exponentially But, it is still challenging to collect it Difficult to get access Redundancy Noise

Processing & Analysis of Big Data Large datasets do not fit in memory Processing large datasets is time consuming Parallel processing is necessary challenging Message Passing Interface (MPI) (1991) Low(er) level C/Fortran API for communication Powerful, hard(er) to code,!fault tolerant Mapreduce Google (2004) Originated from web data processing ease of programming, fault tolerant limited semantics Spark, GraphLab, Storm,

Storage & I/O Storage and I/O are critical for big data performance and reliability Hardware: disks, Flash, SSD, nonvolatile memory, 3D memory Parallelism: RAID, parallel data storage, DFS Data durability and consistency

Data privacy & protection Misuse of big data is a big concern A person s online activities can reveal all aspects of the person s life Systems need to provide clear guidelines on data privacy and protection Sensitive clinical information Understand how the big data world operates as an user as a developer

Parallel Thinking THE MOST IMPORTANT GOAL OF TODAY S LECTURE

Parallelism & beyond 1 ox: single core performance 1024 chickens: parallelism tractor: better algorithms If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens? Seymour Cray Credit: Phillip Stanley-Marbell

Consider an array A with n elements, Goal: to compute, x = n 1 A i Machine Model Programming Model Performance analysis

Von Neumann architecture Central Processing Unit (CPU, Core) Memory Memory Input/Output (I/O) One instruction per unit/time Sequential Control Unit Arithmetic Logic Unit Accumulator Input Output

Characterizing algorithm performance O-notation Given an input of size n, let T(n) be the total time, and S n the necessary storage Given a problem, is there a way to compute lower bounds on storage and time Algorithmic Complexity T n = O f n means T n cf(n), where c is some unknown positive constant compare algorithms by comparing f(n).

Scalability Scale Vertically scale-up Add resources to a single node CPU, memory, disks, Scale Horizontally scale-out Add more nodes to the system

Parallel Performance Speedup Efficiency best sequential time/time on p processors speedup/p, (< 1) Scalability

Amdahl s Law Sequential bottlenecks: Let s be the percentage of the overall work that is sequential Then, the speedup is given by S = 1 s + 1 s p 1 s

Gustafson Sequential part should be independent of the problem size Increase problem size, with increasing number of processors

Strong & Weak Scalability Increasing number of cores Strong (fixed-sized) scalability keep problem size fixed Weak (fixed-sized) scalability keep problem size/core fixed

Parallel Programming Partition Work Data & Tasks Determine Communication Agglomeration to number of available processors Map to processors Tune for architecture

Problem partition communicate agglomerate map

Consider an array A with n elements, Goal: to compute, x = n 1 A i

Work/Depth Models + Abstract programming model Exposes the parallelism Compute work W and depth D D - longest chain of dependencies + + + A 6 A 7 A 8 P = W/D Directed Acyclic Graphs Concepts parallel for (data decomposition) + + + A 3 A 5 A 4 recursion (divide and conquer) A 1 A 2

Work/Depth Models Abstract programming model Exposes the parallelism Compute work W and depth D + D - longest chain of dependencies P = W/D Directed Acyclic Graphs Concepts parallel for (data decomposition) + + + + + + recursion (divide and conquer) A 1 A 2 A 3 A 4 A 5 A 6 A 7 A 8

Sequential vs Parallel for Dependent statements W = W i D = D i Independent statements W = W i D = max(d i )

Parallel Sum language model // Recursive implementation Algorithm SUM(a, n) // Input: array a of length n = 2 k, k = log n parallel_for i [0,n/2) b(i) a(2i) + a(2i+1) return SUM(b); // W n 2, D n 2 Complexity: D n = D n 2 W n = W n 2 + O 1 = O(log n) + O n = O n

Questions?