CS 378 Big Data Programming. Lecture 1 Introduc:on



Similar documents
Machine- Learning Summer School

CS 378 Big Data Programming. Lecture 2 Map- Reduce

CS 378 Big Data Programming

CS 378 Big Data Programming. Lecture 5 Summariza9on Pa:erns

City University of Hong Kong. Course Syllabus. offered by Department of Computer Science with effect from Semester A 2015/16

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Big Data Analytics Hadoop and Spark

Improving MapReduce Performance in Heterogeneous Environments

Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster. A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Unified Big Data Processing with Apache Spark. Matei

Map Reduce & Hadoop Recommended Text:

GraySort on Apache Spark by Databricks

Big Data Processing. Patrick Wendell Databricks

BUDT 758B-0501: Big Data Analytics (Fall 2015) Decisions, Operations & Information Technologies Robert H. Smith School of Business

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

6.S897 Large-Scale Systems

Introduction to Spark

Hadoop Parallel Data Processing

Using RDBMS, NoSQL or Hadoop?

Open source Google-style large scale data analysis with Hadoop

Can t We All Just Get Along? Spark and Resource Management on Hadoop

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Implement Hadoop jobs to extract business value from large and varied data sets

TOP 8 TRENDS FOR 2016 BIG DATA

Chapter 7. Using Hadoop Cluster and MapReduce

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #13: NoSQL and MapReduce

Introduction to Cloud Computing

Learn how to store and analyze Big Data Learn about the cloud and its services for Big Data

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

Cloud Computing. Summary

Big Data for everyone Democratizing big data with the cloud. Steffen Krause Technical

MapReduce, Hadoop and Amazon AWS

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

10605 BigML Assignment 4(a): Naive Bayes using Hadoop Streaming

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

The Easiest Way to Run Spark Jobs. How-To Guide

Introduc8on to Apache Spark

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Brave New World: Hadoop vs. Spark

Architectures for massive data management

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

HADOOP MOCK TEST HADOOP MOCK TEST I

Large scale processing using Hadoop. Ján Vaňo

Performance Management in Big Data Applica6ons. Michael Kopp, Technology

CPS 216: Advanced Database Systems (Data-intensive Computing Systems) Shivnath Babu

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Big Data on Microsoft Platform

HiBench Introduction. Carson Wang Software & Services Group

Hadoop & Spark Using Amazon EMR

Big Data Frameworks Course. Prof. Sasu Tarkoma

How To Create A Data Visualization With Apache Spark And Zeppelin

/ Cloud Computing. Recitation 11 November 10 th and November 12 th, 2015

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Hunk & Elas=c MapReduce: Big Data Analy=cs on AWS

Real Time Big Data Processing

Storage Architectures for Big Data in the Cloud

BIG DATA USING HADOOP

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Introduction to Database Systems CS4320/CS5320. CS4320/4321: Introduction to Database Systems. CS4320/4321: Introduction to Database Systems

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Step by Step: Big Data Technology. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 25 August 2015

MyCloudLab: An Interactive Web-based Management System for Cloud Computing Administration

Big Data Frameworks: Scala and Spark Tutorial

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Large-Scale Data Processing

Hadoop implementation of MapReduce computational model. Ján Vaňo

Customer Case Study. Sharethrough

HIVE + AMAZON EMR + S3 = ELASTIC BIG DATA SQL ANALYTICS PROCESSING IN THE CLOUD A REAL WORLD CASE STUDY

A Tutorial Introduc/on to Big Data. Hands On Data Analy/cs over EMR. Robert Grossman University of Chicago Open Data Group

Distributed Computing and Big Data: Hadoop and MapReduce

Hadoop Cluster Applications

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Unified Big Data Analytics Pipeline. 连 城

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

ECBDL 14: Evolu/onary Computa/on for Big Data and Big Learning Workshop July 13 th, 2014 Big Data Compe//on

Professional Hadoop Solutions

Hands-on Exercises with Big Data

CSE-E5430 Scalable Cloud Computing. Lecture 4

Workshop on Hadoop with Big Data

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

BIG DATA HADOOP TRAINING

Open source large scale distributed data management with Google s MapReduce and Bigtable

Hadoop & its Usage at Facebook

Introduction to Hadoop

CS 40 Computing for the Web

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Transcription:

CS 378 Big Data Programming Lecture 1 Introduc:on

Class Logis:cs Class meets MW, 9:30 AM 11:00 AM Office Hours GDC 4.706 MW 11:00 12:00 AM By appointment Email: dfranke@cs.utexas.edu Web page: cs.utexas.edu/~dfranke/courses/2015spring/cs378- BDP.htm TA: Swadhin Pradhan Office hours:

Course Content Programming in Hadoop (map- reduce) and Spark Use Elas:cMapReuce (EMR) on Amazon Web Services (AWS) ini:ally Hope to use the Hadoop cluster at TACC (if available) Local install of Hadoop Looking into cloud based Spark cluster From DataBricks TACC is another possibility Local install can also be used

Textbooks MapReduce Design Paberns Main content for Hadoop assignments Hadoop The Defini:ve Guide 3 RD Edi:on Recommended for your understanding, not required Learning Spark (early release) Main content for Spark assignments

Lectures PDF of lecture notes accessible via syllabus For your note taking, review, or whatever These notes are my outline for each class MLSS 2015 Big Data Programming 5

Assignments Assignments will be programming assignments All work can be done using Java Scala might be an op:on IDE for developing code recommended Eclipse, IntelliJ IDE (community edi:on) are free Use maven to build uber JAR to upload to the cloud I ll provide the pom.xml file used by maven

Assignments I ll review a solu:on in class on the due date Work submibed aier the start of class considered late 25% penalty for late submission Can be submibed un:l the next assignment is due Aier that deadline, no credit is given Will consider these in determining final grade I encourage you to keep pace with the assignments Most assignments will build on previous work

Ques:ons? MLSS 2015 Big Data Programming 8

Learning from Data What can we do when the data gets big? Too big for the CPU memory of any single machine Larger than the disk storage of a single machine Recent data point: Facebook has ~800 petabyte data cluster (Hadoop) 1 petabyte = 10 15 bytes Big data is spread across a network of machines MLSS 2015 Big Data Programming 9

Learning from Big Data Need to bring distributed storage and distributed processing to bear to handle big data Issues: Distribu:ng computa:on across many machines Maximizing performance Minimize I/O to disk, minimize transfers across the network Combining the results of distributed computa:on Recovering from failures MLSS 2015 Big Data Programming 10

Managing Big Data We ll look at two popular tools/systems One well established Hadoop One up and coming Spark Basic concepts of each How they address the aforemen:oned issues How to solve various problems with these systems MLSS 2015 Big Data Programming 11

Managing Big Data When wri:ng a program with these tools You don t know the size of the data You don t know the extent of the parallelism Both try to collocate the computa:on with the data Parallelize the I/O Make the I/O local (versus across network) Data is oien unstructured (vs. rela:onal model) MLSS 2015 Big Data Programming 12

Big Data vs. Rela:onal RDBMS normaliza:on Goal is to remove redundancy and retain/insure integrity Big data apps want reads to be local Send the code to the data, as it much smaller (Jim Gray) Normaliza:on makes read non- local Processing examines one input record at a :me Minimal state in programs it s in the data MLSS 2015 Big Data Programming 13

Big Data Tools This all sounds great. What are the issues? Coordina:ng the distributed computa:on Handling par:al failures Combining the results of distributed computa:on Tools offer a programming model that abstracts Disk read and write Paralleliza:on (computa:on and I/O) Combining data (keys and values) MLSS 2015 Big Data Programming 14

MapReduce Design Paberns Summariza:on Filtering Data Organiza:on Par::oning/binning, sor:ng, shuffle Joins Merging data sets Meta- paberns Op:mizing map- reduce chains (data pipelines) MLSS 2015 Big Data Programming 15

Resources for Hadoop Hadoop: The Defini/ve Guide, 3rd Edi/on, by Tom White O Reilly Media Print ISBN: 978-1- 4493-1152- 0 ISBN 10: 1-4493- 1152-0 Ebook ISBN: 978-1- 4493-1151- 3 ISBN 10: 1-4493- 1151-2 MapReduce Design Pa=erns, by Donald Miner and Adam Shook O Reilly Media Print ISBN: 978-1- 4493-2717- 0 ISBN 10: 1-4493- 2717-6 Ebook ISBN: 978-1- 4493-4197- 8 ISBN 10: 1-4493- 4197-7 hbp://hadoop.apache.org/ Several vendors provide Hadoop distribu:ons Amazon Web Services Elas:cMapReduce MLSS 2015 Big Data Programming 16

Resources for Spark Learning Spark, (early release) by Holden Karau, Andy Konwinsky, Patrick Wendell, Matei Zaharia O Reilly Media Print ISBN: 978-1- 4493-5862- 4 ISBN 10: 1-4493- 5862-4 Ebook ISBN: 978-1- 4493-5860- 0 ISBN 10: 1-4493- 5860-8 hbp://spark.apache.org/ Can download a version that runs on your local machine Cloud services Spark on AWS DataBricks offers a cloud service Others will join the party MLSS 2015 Big Data Programming 17