15-319 / 15-619 Cloud Computing. Recitation 11 November 10 th and November 12 th, 2015



Similar documents
/ Cloud Computing. Recitation 5 September 29 th & October 1 st 2015

Real Time Big Data Processing

Can the Elephants Handle the NoSQL Onslaught?

Hadoop & Spark Using Amazon EMR

Putting Apache Kafka to Use!

Replicating to everything

CS / Cloud Computing. Recitation 2 September 2 & 4, 2014

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Using distributed technologies to analyze Big Data

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Next-Generation Cloud Analytics with Amazon Redshift

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Big Data Analytics - Accelerated. stream-horizon.com

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

CS / Cloud Computing

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Tungsten Replicator, more open than ever!

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

How To Use Big Data For Telco (For A Telco)

Building Your Big Data Team

Structured Data Storage

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Using Kafka to Optimize Data Movement and System Integration. Alex

IBM WebSphere DataStage Online training from Yes-M Systems

Native Connectivity to Big Data Sources in MSTR 10

Big Data and Market Surveillance. April 28, 2014

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

How to Enhance Traditional BI Architecture to Leverage Big Data

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Unified Big Data Processing with Apache Spark. Matei

How To Scale Out Of A Nosql Database

SQL Server Administrator Introduction - 3 Days Objectives

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

REAL-TIME BIG DATA ANALYTICS

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Apache Flink Next-gen data analysis. Kostas

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Big data blue print for cloud architecture

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

6 Steps to Faster Data Blending Using Your Data Warehouse

Syllabus for Course : Database Systems Engineering at Kinneret College

The Inside Scoop on Hadoop

The Sierra Clustered Database Engine, the technology at the heart of

Hadoop IST 734 SS CHUNG

From Dolphins to Elephants: Real-Time MySQL to Hadoop Replication with Tungsten

Daniel J. Adabi. Workshop presentation by Lukas Probst

Data Analytics Infrastructure

Benchmarking and Analysis of NoSQL Technologies

Real Time Analytics for Big Data. NtiSh Nati

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Workshop on Hadoop with Big Data

Big Data Technology CS , Technion, Spring 2013

Accelerate the Performance of Virtualized Databases Using PernixData FVP Software

Big Systems, Big Data

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

Ryan Horn, Lead Software Engineer at Twilio. November 12, 2014 Las Vegas. BDT312 Using the Cloud to Scale from a Database to a Data Platform

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project

Alexander Rubin Principle Architect, Percona April 18, Using Hadoop Together with MySQL for Data Analysis

ANALYTICS CENTER LEARNING PROGRAM

Integrating Big Data into the Computing Curricula

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

How to Choose Between Hadoop, NoSQL and RDBMS

Design for Failure High Availability Architectures using AWS

Big Data Technologies Compared June 2014

Hadoop and MySQL for Big Data

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

GigaSpaces Real-Time Analytics for Big Data

Ali Ghodsi Head of PM and Engineering Databricks

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, XLDB Conference at Stanford University, Sept 2012

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Data Warehousing Concepts

BASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Infrastructures for big data

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

The Internet of Things and Big Data: Intro

Large scale processing using Hadoop. Ján Vaňo

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Introduction to Hadoop

Week 3 lecture slides

Transcription:

15-319 / 15-619 Cloud Computing Recitation 11 November 10 th and November 12 th, 2015

Overview Administrative issues Tagging, 15619Project, project code Last week s reflection Project 3.4 Quiz 9 This week s schedule Project 3.5 Unit 5 - Module 18 15619Project Phase 2 Quiz 10 Demo Twitter Analytics: The 15619Project

Reminders Monitor AWS expenses regularly and tag all resources Check your bill (Cost Explorer > filter by tags). Piazza Guidelines Please tag your questions appropriately Search for an existing answer first Provide clean, modular and well documented code Large penalties for not doing so. Double check that your code is submitted!! (verify by downloading it from TPZ from the submissions page) Utilize Office Hours We are here to help (but not to give solutions) Use the team AWS account and tag the 15619Project resources carefully

Project 3.4 : FAQs Problem 1: DynamoDB import failure Make sure the data is processed correctly to meet DynamoDB s requirement. Be aware of escaping problems. Problem 2: Maven dependency error Be aware of what you include. Some dependencies might have conflicts.

This week: Project 3.5 P3.1: Files, SQL and NoSQL P3.2: Sharding and Replication P3.3: Consistency P3.4: Social network with heterogeneous backends P3.5: Data warehousing and OLAP

P3.5: Background Carnegie Eagle(CE), a supermarket chain wishes to expand their business further into other markets and wants a decision support system The CTO of CE wants you to analyze various data warehouses and pick the best one for the job The CTO subscribes to the Hottest Data Warehouses weekly and decides she wants you to analyze the following: Hive Impala Redshift

OLTP vs OLAP

Data warehousing and OLAP OLAP (Online Analytical Processing) queries deals with historical/archived data OLAP warehouses optimized for reads and aggregations Rarely perform updates Tables in OLAP are denormalized as compared to normalized tables in OLTP (Online Transaction Processing) Data warehouses tuned for high throughput since they process large amounts of data OLTP databases are more tuned for smaller updates and lower latency

Hive Impala Redshift OLAP data warehouses Built on Hadoop and provides a SQL-like interface to query the data Translates user SQL query to MapReduce jobs Based on Hadoop as well and provides a SQL-like interface similar to Hive Uses its own engine to translate user queries and directly access data on the cluster (hence lower latency) Datawarehouse-as-a-Service provided by Amazon Web Services Used for real time analytics

Module to Read UNIT 5: Distributed Programming and Analytics Engines for the Cloud Module 18: Intro to distributed programming for the Cloud Module 19: Distributed analytics engines: MapReduce Module 20: Distributed analytics engines: Spark Module 21: Distributed analytics engines: GraphLab

Distributed Programming Taxonomy of Programs: Sequential Concurrent Parallel Challenges in programming the cloud: Scalability Communication overhead Heterogeneity Synchronization Fault Tolerance Scheduling

Upcoming Deadlines Quiz 9 : Unit 5 - Module 18 Open: 11/13/2015 12:01 AM Pittsburgh Due: 11/13/2015 11:59 PM Pittsburgh Project 3.5 : Data warehousing and OLAP Due: 11/15/2015 11:59 PM Pittsburgh 15619Project : Phase 2 Live-test due: 11/11/2015 4:59 PM Pittsburgh Code and report due: 11/12/2015 11:59 PM Pittsburgh

Busy Weeks Coming Up! Wednesday Thursday Friday Sunday Wednesday 11/11/2015 18:00:01 EST Phase 2 Live Test Thursday 11/12/2015 23:59:59 EST Phase 2 Code & Report Due Friday 11/13/2015 23:59:59 EST Quiz 10 Sunday 11/15/2015 23:59:59 EST P3.5 Due Wednesday 12/2/2015 20:00:01 EST Phase 3 Live Test Thursday 12/3/2015 23:59:59 EST Phase 3 Code & Report Due Friday 12/4/2015 23:59:59 EST Quiz 12 Sunday 12/6/2015 23:59:59 EST P4.2 Due

Project 3.5 OLAP with Cloud Data Warehousing

Star Schema

Data Warehousing Benchmark

Data Warehousing Benchmark Hive No optimization required Follow the instructions and you are done Impala and Redshift Load data, execute unoptimized queries Optimize table schemas and/or queries The evaluation will be on both correctness and response time

Notes Hive May take a few hours, be patient. Impala Some unoptimized queries may throw exceptions. Redshift Be aware of the high expenditure! Think before you start.

Project 3.5 Demo

twitter DATA ANALYTICS: 15619 PROJECT

15619 Project Phase 2 Deadlines 15619 Project Phase1 15619 Project Phase 2 Live Test Thursday 10/15/2015 00:00:01 ET Thursday 10/29/2015 23:59:59 ET Wednesday 11/11/2015 16:00:01 ET Wednesday 11/11/2015 23:59:59 ET Thursday 11/12/2015 23:59:59 ET 15619 Project Phase 2 Q3 & Q4 Development 15619 Project Phase 2 Code & Report Due

Phase 2 Two more queries (Q3 and Q4) More ETL Multiple tables and queries Live Test!!! Warmup, Q1, Q2, Q3, Q4, MixedQ1-Q4 Each for 30 min Both HBase and MySQL Two DNS Must get over 50% score on both! No more pre-caching of known requests

Query 3: Spreading Happiness & Sadness Q. How to compute impact score for each tweet? score = sentiment_score * ( 1 + followers_count ) Q. What s followers_count and how do I find it? Read https://dev.twitter.com/docs/platform-objects/tweets

Query 3: Spreading Happiness & Sadness Sample Query: GET /q3? start_date=2012-10-17&end_date=2014-11-30&userid=2334640502 &n=10 n will be less or equal to 10

Query 3: Spreading Happiness & Sadness Sample Response: Positive Tweets 2014-04-18,536,456980867093524480,@AustinMahone Austin please follow me i love you much! x44 2014-04-07,512,453011017078145024,@ArianaGrande Ariana please follow me i love you much! x72 2014-04-07,512,453011398722064384,@ArianaGrande Ariana please follow me i love you much! x86 2014-04-07,512,453012510233624576,@ArianaGrande Ariana please follow me i love you much! x114 2014-04-07,512,453020030607695873,@ArianaGrande Ariana please follow me i love you much! x257 2014-04-07,508,453007393199513600,@ArianaGrande Ariana please follow me i love you much! x8 2014-04-18,270,456984730043711489,@AustinMahone please follow me. I know that it is "impossible" that you see this tweet but if you see it, please!! x45 2014-04-22,270,458751123390615552,@ArianaGrande please follow me. I know that it is "impossible" that you see this tweet but if you see it, please!! x7 Negative Tweets

Query 4: Hashtag Analysis Use the hashtag entity Sample Query: GET /q4?hashtag=01net&n=4 Sample Response: 2014-05-16:2:20598769,617996172:#01net: Philips attaque Nintendo et veut faire bannir la Wii U des ventes aux Etats-Unis http://t.co/v55bilpfoj 2014-03-28:1:20598769:#01net: Apple pourrait lancer un MacBook Air Retina et un ipad 12'' au second semestre http://t.co/7iry9qan2u 2014-03-29:1:20598769:#01net: TV 4K : faut-il craquer tout de suite? Nos tests, les prix et ce qui arrive... http://t.co/jrpzoajtfy 2014-04-08:1:20598769:#01net: Le rapprochement entre Bouygues Telecom et Free se prépare en douceur http://t.co/d3vipniyvv

Phase 2 Clarifications Censorship? q3:, q4: X No required time filtering Tie breaker? q3: q4: Primary key - absolute value of score (descending) Secondary key - tweet id resolve ties by tweet i (ascending) Primary key - count, date with higher count com first Secondary key - date, earliest date comes first

Tips for Phase 2 Carefully design and check ETL process Watch your budget. $60 = phase + Live Test Preparing for the live test You are required to submit two URLs, one for the MySQL live test and one for the HBase for live test Budget limited to $1.25/hr for MySQL and HBase web service separately. Tag correctly. Do NOT run anything else. Caching known requests will not work Need to have all Q1-Q4 running at the same time Don t expect testing in sequence.

MySQL Understand the difference between different storage engines Understand the difference between different types of indices Remember, replication may not be the best strategy if the tables are big Avoid costly operations in front-end Make use of connection pools, caching configuration in MySQL Consider using a MySQL Cluster

HBase Make use of WebUI: Check which tables are getting requests the mosts Use split, move, merge functions to evenly distribute loads Make sure your front end is not the bottleneck Carefully design schema Consider building your own HBase- EMR is expensive and abstracts away details

Phase 2 Deadline What s due soon? Submission by 16:59 ET (Pittsburgh) Wed 11/11 HBase Live from 6 PM to 9 PM ET MySQL Live from 9 PM to midnight ET Fix Q1 and Q2 if your Phase 1 did not go well New queries Q3 and Q4. Targets 5000 and 8000 rps Get score higher than 50% on both MySQL and HBase Heads up: Phase 2 counts for 30% of 15619Project grade Report at the end of Phase 2 Make sure you highlight failures and learning If you didn t do well, explain why If you did, explain how Cannot begin to stress how critical this is!!!!