6.S897 Large-Scale Systems

Similar documents
Unified Big Data Processing with Apache Spark. Matei

Open source Google-style large scale data analysis with Hadoop

Moving From Hadoop to Spark

Dominik Wagenknecht Accenture

Open Source for Cloud Infrastructure

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Open source large scale distributed data management with Google s MapReduce and Bigtable

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Large-Scale Data Processing

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Cloud Computing Training

Search Engine Architecture

Dell In-Memory Appliance for Cloudera Enterprise

How To Create A Data Visualization With Apache Spark And Zeppelin

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

HDP Enabling the Modern Data Architecture

How Companies are! Using Spark

HDP Hadoop From concept to deployment.

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Hadoop Ecosystem B Y R A H I M A.

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Hadoop IST 734 SS CHUNG

So What s the Big Deal?

Big Data and Industrial Internet

Ali Ghodsi Head of PM and Engineering Databricks

Challenges for Data Driven Systems

Enabling High performance Big Data platform with RDMA

Datacenters and Cloud Computing. Jia Rao Assistant Professor in CS

Introduction to Big Data Training

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

VP/GM, Data Center Processing Group. Copyright 2014 Cavium Inc.

BIG DATA TRENDS AND TECHNOLOGIES

Hadoop & Spark Using Amazon EMR

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

The Flash Transformed Data Center & the Unlimited Future of Flash John Scaramuzzo Sr. Vice President & General Manager, Enterprise Storage Solutions

Evolution from Big Data to Smart Data

HYBRID CLOUD SUPPORT FOR LARGE SCALE ANALYTICS AND WEB PROCESSING. Navraj Chohan, Anand Gupta, Chris Bunch, Kowshik Prakasam, and Chandra Krintz

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Web Technologies: RAMCloud and Fiz. John Ousterhout Stanford University

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

The Rise of Cloud Computing Systems

Hadoop: Embracing future hardware

Customer Case Study. Sharethrough

Spark: Making Big Data Interactive & Real-Time

Data Centers and Cloud Computing

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Pla7orms for Big Data Management and Analysis. Michael J. Carey Informa(on Systems Group UCI CS Department

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Cloud Big Data Architectures

Scalable Architecture on Amazon AWS Cloud

Workshop on Hadoop with Big Data

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

White Paper: What You Need To Know About Hadoop

Spark: Cluster Computing with Working Sets

ebook Big Data Explained, Analysed, Solved

Real Time Big Data Processing

Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

The Greenplum Analytics Workbench

Comprehensive Analytics on the Hortonworks Data Platform

Peers Techno log ies Pv t. L td. HADOOP

A Comparison of Clouds: Amazon Web Services, Windows Azure, Google Cloud Platform, VMWare and Others (Fall 2012)

Marko Grobelnik Jozef Stefan Institute Ljubljana, Slovenia

Distributed Data Parallel Computing: The Sector Perspective on Big Data

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

HYPER-CONVERGED INFRASTRUCTURE STRATEGIES

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012

APP DEVELOPMENT ON THE CLOUD MADE EASY WITH PAAS

Apache Flink Next-gen data analysis. Kostas

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

The Cloud to the rescue!

Disrupting The Market: Predictive Analytics As A Service

Virtualizing Apache Hadoop. June, 2012

Big Data Infrastructure at Spotify

Managing large clusters resources

Case Study : 3 different hadoop cluster deployments

June Production Hadoop systems in the enterprise

Big Data Explained. An introduction to Big Data Science.

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Performance and Scalability Overview

ebay Storage, From Good to Great

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Session 0202: Big Data in action with SAP HANA and Hadoop Platforms Prasad Illapani Product Management & Strategy (SAP HANA & Big Data) SAP Labs LLC,

Transcription:

6.S897 Large-Scale Systems Instructor: Matei Zaharia" Fall 2015, TR 2:30-4, 34-301 bit.ly/6-s897

Outline What this course is about" " Logistics" " Datacenter environment

What this Course is About Large-scale computer systems Web applications: Facebook, Gmail, etc Big data: one computation on many machines Clouds: software or infrastructure as a service Systems that run on hundreds of nodes

Why Study Large-Scale Systems? Increasingly run most computer applications One of the more likely areas for impact Rich systems & algorithmic problems

Trends Behind Large-Scale Systems 1. Growth of workloads (users, data) relative to machine speeds 2. Faster Internet 3. Economics Benefits of software and infra as a service Economies of scale for providers

1. Growth of Workloads Growing # of Internet users Growth of data w.r.t. computation speeds Mostly from machines: sensors, images, IoT, etc

2. Faster Internet Speed of a high-end residential connection

3. Economics Benefits of software as a service: For vendors: single deployment target, visibility For users: easier to manage Benefits of infrastructure as a service: Elastic scaling, pay-as-you-go Economies of scale (lower costs in bulk)

This Course Papers & readings on influential systems File systems, databases, coordination, processing frameworks, resource managers, networks, performance, programming tools Guest talks from 4 speakers Focus on what s widely deployed

Outline What this course is about" " Logistics" " Datacenter environment

Readings 2-3 per class Each has summary questions Email answers to 6.s897staff@gmail.com Each student will present 1 paper in class 15 minute talk; see website

Projects Ideally in groups of 2-3 Term-long mini research project of your choice Can be related to your research Matei will list some ideas Report + poster session at end of term

Project Timeline Oct 2 Oct 9 Nov 10 Dec 10 Dec 15 Form groups, run idea by Matei Initial proposal (1-2 pages) Mid-term review Poster session Final writeup (10-12 pages)

Grading 70% project 15% paper presentation 15% summary questions + participation

Course Staff Instructor: Matei Zaharia Office hours TBD, likely Tuesday at 4 TA: Rohan Mahajan Will help with summary questions & logistics

Other Notes The course is 12 units! I have some EC2 credits for the projects

Outline What this course is about" " Logistics" " Datacenter environment

Typical Datacenter

Typical Datacenter 10K-100K servers 40 servers per rack 10 Gbps bandwidth in rack;" 20%-100% bisection bandwidth 10-100 MW of power

Typical Server CPUs:" 16-32 cores 1 GB/s NIC: 10 Gbps DRAM: 64-256 GB HDDs: 2-40 TB SSDs?

Recent Hardware Changes Node bandwidth: 1 Gbps! 10 Gbps Embrace of SSDs (flash) Almost a given for transactional workloads Soon may be competitive for capacity / $ Better I/O virtualization

Types of Applications 1. Single big computation (e.g. big data) 2. Hosted apps for many tenants (e.g. Gmail, web hosting, cloud) 3. Single big multi-user app (e.g. Facebook)

Software Stack Web Server Java, PHP, JS, Higher Interfaces Hive, Pig, HiPal, Cache memcached, TAO, Operational Store SQL, Spanner, Dynamo, Cassandra, BigTable, Other Services search, chat, ML; Unicorn, Druid, Logging Bus Kafka, Kinesis, Processing Engines MapReduce, Dryad, Pregel, Spark, Metadata Hive, Parquet, Coordination Chubby, ZK, Distributed File System GFS, Hadoop FS, Amazon S3, Resource Manager Mesos, Borg, Kubernetes, EC2,

Software Stack Data Processing Web Server Java, PHP, JS, Higher Interfaces Hive, Pig, HiPal, Cache memcached, TAO, Operational Store SQL, Spanner, Dynamo, Cassandra, BigTable, Other Services search, chat, ML; Unicorn, Druid, Logging Bus Kafka, Kinesis, Processing Engines MapReduce, Dryad, Pregel, Spark, Metadata Hive, Parquet, Coordination Chubby, ZK, Distributed File System GFS, Hadoop FS, Amazon S3, Resource Manager Mesos, Borg, Kubernetes, EC2,

Software Stack Online Apps Web Server Java, PHP, JS, Higher Interfaces Hive, Pig, HiPal, Cache memcached, TAO, Operational Store SQL, Spanner, Dynamo, Cassandra, BigTable, Other Services search, chat, ML; Unicorn, Druid, Logging Bus Kafka, Kinesis, Processing Engines MapReduce, Dryad, Pregel, Spark, Metadata Hive, Parquet, Coordination Chubby, ZK, Distributed File System GFS, Hadoop FS, Amazon S3, Resource Manager Mesos, Borg, Kubernetes, EC2,

Software Stack Offered by Cloud Services Web Server Java, PHP, JS, Higher Interfaces Hive, Pig, HiPal, Cache memcached, TAO, Operational Store SQL, Spanner, Dynamo, Cassandra, BigTable, Other Services search, chat, ML; Unicorn, Druid, Logging Bus Kafka, Kinesis, Processing Engines MapReduce, Dryad, Pregel, Spark, Metadata Hive, Parquet, Coordination Chubby, ZK, Distributed File System GFS, Hadoop FS, Amazon S3, Resource Manager Mesos, Borg, Kubernetes, EC2,

Software Stack Available as Open Source Web Server Java, PHP, JS, Higher Interfaces Hive, Pig, HiPal, Cache memcached, TAO, Operational Store MySQL, HBase, Redis, Cassandra, Mongo, Other Services search, chat, ML; ElasticSearch, Druid, Logging Bus Kafka, Flume, Processing Engines Hadoop MR, Giraph, Storm, Spark, Metadata Hive, Parquet, Coordination ZooKeeper, Distributed File System GFS, Hadoop FS, Ceph, Amazon S3, Resource Manager Mesos, Borg, Kubernetes, OpenStack,

Coverage in This Course Broad overview + a few hot areas Streaming, performance, complex analytics Focus on what people actually do Meant to drive new ideas / research

Key Themes 1. Cost of people vs software/hardware Everyone works to lower development time, operations time, and time-to-answer 2. Simple, reusable abstractions 3. Statistical effects of scale 4. Moving target of efficiency New hardware, app needs, multitenancy,

Next Week Tuesday: 3 intro readings (cloud computing and two readings from Facebook) Volunteers for these? Thursday: talk by Ali Ghodsi (Databricks) on big data processing as a service