Large-scale Data Processing on the Cloud



Similar documents
MTAT Basics of Cloud Computing (3 ECTS) Satish Srirama

Cloud Computing Summary and Preparation for Examination

Above the Clouds A Berkeley View of Cloud Computing

Cloud Computing An Elephant In The Dark

Clouds vs Grids KHALID ELGAZZAR GOODWIN 531

Security and Privacy in Cloud Computing

Introduction to Cloud Computing

Cloud computing. Examples

Introduction to Cloud Computing

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

DISTRIBUTED COMPUTER SYSTEMS CLOUD COMPUTING INTRODUCTION

What Is It? Business Architecture Research Challenges Bibliography. Cloud Computing. Research Challenges Overview. Carlos Eduardo Moreira dos Santos

Sistemi Operativi e Reti. Cloud Computing

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

CUMULUX WHICH CLOUD PLATFORM IS RIGHT FOR YOU? COMPARING CLOUD PLATFORMS. Review Business and Technology Series

Cloud Computing an introduction

Cloud Computing: Making the right choices

Chapter 2 Cloud Computing

Datacenters and Cloud Computing. Jia Rao Assistant Professor in CS

High Performance Computing Cloud Computing. Dr. Rami YARED

Cloud Computing Now and the Future Development of the IaaS

How To Run A Cloud Computer System

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Clearing Away the Clouds: What is the Future of Cloud Computing? BEBO WHITE PEWE WORKSHOP BRATISLAVA APRIL 2010

Plan of the seminar. Plan. 1 Cloud computing: what is it? 2 Cloud Computation and business. 3 Cloud Computing and Project Management 1/38

Part V Applications. What is cloud computing? SaaS has been around for awhile. Cloud Computing: General concepts

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

An Introduction to Private Cloud

Putchong Uthayopas, Kasetsart University

24/11/14. During this course. Internet is everywhere. Frequency barrier hit. Management costs increase. Advanced Distributed Systems Cloud Computing

Cloud Courses Description

Ø Teaching Evaluations. q Open March 3 through 16. Ø Final Exam. q Thursday, March 19, 4-7PM. Ø 2 flavors: q Public Cloud, available to public

How To Understand Cloud Computing

Big Data on Microsoft Platform

DISTRIBUTED SYSTEMS AND CLOUD COMPUTING. A Comparative Study

CLOUD COMPUTING USING HADOOP TECHNOLOGY

Mobile & Cloud Computing: Research Challenges. Satish Srirama satish.srirama@ut.ee

Cloud Computing Trends

Cloud Courses Description

21/09/11. Introduction to Cloud Computing. First: do not be scared! Request for contributors. ToDO list. Revision history

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing WHAT IS CLOUD COMPUTING? 2

Aneka: A Software Platform for.net-based Cloud Computing

Cloud computing - Architecting in the cloud

A New Approach of CLOUD: Computing Infrastructure on Demand

Big Data Explained. An introduction to Big Data Science.

OTM in the Cloud. Ryan Haney

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Sunnie Chung. Cleveland State University

Viswanath Nandigam Sriram Krishnan Chaitan Baru

Oracle Big Data SQL Technical Update

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

GigaSpaces Real-Time Analytics for Big Data

Cloud Computing Paradigm

Challenges for Data Driven Systems

Cloud Computing. IST 501 Fall Dongwon Lee, Ph.D.

Manifest for Big Data Pig, Hive & Jaql

HYBRID CLOUD SUPPORT FOR LARGE SCALE ANALYTICS AND WEB PROCESSING. Navraj Chohan, Anand Gupta, Chris Bunch, Kowshik Prakasam, and Chandra Krintz

Cloud Computing. Summary

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

BIG DATA TRENDS AND TECHNOLOGIES

How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns

Step by Step: Big Data Technology. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 25 August 2015


Cloud Computing and Big Data What Technical Writers Need to Know

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Cloud Computing Submitted By : Fahim Ilyas ( ) Submitted To : Martin Johnson Submitted On: 31 st May, 2009

Scalable Architecture on Amazon AWS Cloud

Hadoop IST 734 SS CHUNG

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Moving From Hadoop to Spark

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Topics. Images courtesy of Majd F. Sakr or from Wikipedia unless otherwise noted.

What is Cloud Computing? First, a little history. Demystifying Cloud Computing. Mainframe Era ( ) Workstation Era ( ) Xerox Star 1981!

What is Cloud Computing? Tackling the Challenges of Big Data. Tackling The Challenges of Big Data. Matei Zaharia. Matei Zaharia. Big Data Collection

Public Cloud Offerings and Private Cloud Options. Week 2 Lecture 4. M. Ali Babar

Cloud Computing Is In Your Future

Application Development. A Paradigm Shift

Cloud Computing Service Models, Types of Clouds and their Architectures, Challenges.

How To Understand Cloud Computing

The Cloud at Crawford. Evaluating the pros and cons of cloud computing and its use in claims management

Transcription:

Large-scale Data Processing on the Cloud MTAT.08.036 Lecture 1: Data analytics in the cloud Satish Srirama satish.srirama@ut.ee

Course Purpose Introduce cloud computing concepts Introduce data analytics in the cloud Introduce Distributed computing algorithms like MapReduce BigData solutions like Pig and Hive NoSQL Glance of research at Mobile & Cloud Lab in cloud computing domain http://courses.cs.ut.ee/2015/ldpc/ 9/9/2015 Satish Srirama 2/45

Course Schedule Lectures: (Weeks 2-9) Wed. 10.15-12.00 (J. Liivi 2-512) Practice sessions: (Weeks 3-10) Tue. 10.15-12.00 (J. Liivi 2-003) 9/9/2015 Satish Srirama 3/45

Related Courses MTAT.08.027 Basics of Cloud Computing (3 ECTS) Spring 2016 MTAT.03.280 Mobile and Cloud Computing Seminar (3 ECTS) Wed. 08.15-10.00, J. Liivi2 512 MTAT.03.266 Mobile Application Development Projects (3 ECTS) Tue. 10.15-12.00, J. Liivi 2 512 9/9/2015 Satish Srirama 4/45

Questions Is everyone comfortable with data structures? How comfortable you are with algorithms? How comfortable you are with programming? Java? External APIs? Web programming 9/9/2015 Satish Srirama 5/45

Grading Written exam 50% Labs 45% 7+ lab exercises You must submit lab exercise solutions by Monday night of the week after (for full score) Late submission for a week (80%) Late submission for 2 weeks (50%) Hard deadlines will be mentioned later Active participation in the lectures (Max 5%) To pass the course You need to score at least 50% in each of the subsections You need to score at least 50% in the total 9/9/2015 Satish Srirama 6/45

Course schedule 09.09Data analytics in the cloud 16.09 MapReduce 23.09 MapReduce algorithms 30.09 Information Retrieval with MapReduce 07.10 Pig 14.10 Hive 21.10 Spark 28.10 NoSQL 04.11 Examination 1 9/9/2015 Satish Srirama 7/45

Course schedule -continued Labs 15.09 Starting with a cloud 22.09 MapReduce - Basics 29.09 Data analysis with MapReduce 06.10 Information Retrieval with MapReduce 13.10 Working with Pig 20.10 Working with Hive 27.10 Spark ++ NoSQL 9/9/2015 Satish Srirama 8/45

Reference Books Mastering Cloud Computing: Foundations and Applications Programming Authors: Rajkumar Buyya, Christian Vecchiola, S.Thamarai Selvi Data-Intensive Text Processing with MapReduce Authors: Jimmy Lin and Chris Dyer http://lintool.github.io/mapreducealgorithms /MapReduce-book-final.pdf White, Tom. Hadoop: the definitive guide. O'Reilly, 2012. 9/9/2015 Satish Srirama 9/45

Reference Papers M. Armbrust et al., Above the Clouds, A Berkeley View of Cloud Computing, Technical Report, University of California, Feb, 2009. Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113. 9/9/2015 Satish Srirama 10/45

Lecture 1 DATA ANALYTICS IN THE CLOUD 9/9/2015 Satish Srirama 11

It s nothing new It s a trap...we ve redefined Cloud It s worse than stupidity: it s Computing to include everything marketing hype. Somebody is that we already do... I don t saying this is inevitable and understand what we would do whenever you hear that, it s very differently... other than change likely to be a set of businesses the wording of some of our ads. campaigning to make it true. WHAT IS CLOUD COMPUTING? Larry Ellison, CEO, Oracle (Wall Richard Stallman, Founder, Free Street Journal, Sept. 26, 2008) Software Foundation (The Guardian, Sept. 29, 2008) No consistent answer! Everyone thinks it is something else 9/9/2015 Satish Srirama 12 Slide taken from Professor Anthony D. Joseph s lecture at RWTH Aachen

What is Cloud Computing? Computing as a utility Utility services e.g. water, electricity, gas etc Consumers pay based on their usage 9/9/2015 Satish Srirama 13/45

Timeline 9/9/2015 Satish Srirama 14/45

Clouds -Why Now (not then)? Experience with very large datacenters Unprecedented economies of scale Transfer of risk Technology factors Pervasive broadband Internet Maturity in Virtualization Technology Business factors Minimal capital expenditure Pay-as-you-go billing model 9/9/2015 Satish Srirama 15/45

Virtualization Virtualization techniques are the basis of the cloud computing Virtualization technologies partition hardware and thus provide flexible and scalable computing platforms App App App Virtual machine techniques VMware and Xen OpenNebula Amazon EC2 OS OS Hypervisor Hardware OS Virtualized Stack 9/9/2015 Satish Srirama 16/45

Cloud Computing -Characteristics Illusion of infinite resources No up-front cost Fine-grained billing (e.g. hourly) Gartner: Cloud computing is a style of computing where massively scalable IT-related capabilities are provided as a service across the Internet to multiple external customers 9/9/2015 Satish Srirama 17/45

Cloud Computing -Services Software as a Service SaaS A way to access applications hosted on the web through your web browser Platform as a Service PaaS Provides a computing platform and a solution stack (e.g. LAMP) as a service Infrastructure as a Service IaaS Use of commodity computers, distributed across Internet, to perform parallel processing, distributed storage, indexing and mining of data Virtualization SaaS Facebook, Flikr,Myspace.com, Google maps API, Gmail PaaS Google App Engine, Force.com, Hadoop, Azure, Heroku, etc IaaS Amazon EC2, Rackspace, GoGrid, SciCloud, etc. Level of Abstraction 9/9/2015 Satish Srirama 18/45

Cloud Computing -Themes Massively scalable On-demand & dynamic Only use what you need -Elastic No upfront commitments, use on short term basis Accessible via Internet, location independent Transparent Complexity concealed from users, virtualized, abstracted Service oriented Easy to use SLAs SLA Service Level Agreement 9/9/2015 Satish Srirama 19/45

Cloud Models Internal (private) cloud Cloud with in an organization Community cloud Cloud infrastructure jointly owned by several organizations Public cloud Cloud infrastructure owned by an organization, provided to general public as service Hybrid cloud Composition of two or more cloud models 9/9/2015 Satish Srirama 20/45

Short Term Implications of Clouds Startups and prototyping Minimize infrastructure risk Lower cost of entry Batch jobs One-off tasks Washington post, NY Times Cost associatively for scientific applications Research at scale 9/9/2015 Satish Srirama 21/45

Cloud Application Demand Many cloud applications have cyclical demand curves Daily, weekly, monthly, Resources Demand Workload spikes are more frequent and significant When some event happens like a pop star has expired: More # tweets, Wikipedia traffic increases 22% of tweets, 20% of Wikipedia traffic when Michael Jackson expired in 2009 Google thought they are under attack Time 9/9/2015 Satish Srirama 22/45

Economics of Cloud Users Pay by use instead of provisioning for peak Resources Capacity Demand Resources Capacity Demand Time Static data center Time Data center in the cloud Unused resources 9/9/2015 Satish Srirama 23/45

Economics of Cloud Users -continued Risk of over-provisioning: underutilization Huge sunk cost in infrastructure Capacity Unused resources Resources Demand Time Static data center 9/9/2015 Satish Srirama 24/45

Economics of Cloud Users -continued Heavy penalty for under-provisioning Resources 1 2 3 Time (days) Capacity Demand Resources Resources 1 2 3 Time (days) Lost revenue 1 2 3 Time (days) Lost users 9/9/2015 Satish Srirama 25/45 Capacity Demand Capacity Demand

Long Term Implications of clouds Application software: Cloud & client parts, disconnection tolerance Infrastructure software: Resource accounting, VM awareness Hardware systems: Containers, energy proportionality 9/9/2015 Satish Srirama 26/45

Cloud Computing Progress Armando Fox, 2010 9/9/2015 Satish Srirama 27/45

Our Data-driven World Already as of 2012, every day 2.5 exabytes (2.5 10 18 ) of data were created [https://en.wikipedia.org/wiki/big_data] Science Data bases from astronomy, genomics, environmental data, Sensors and Smart Devices Humanities and Social Sciences Scanned books, historical documents, social interactions data, Business & Commerce Corporate sales, stock market transactions, census, airline traffic, Entertainment Internet images, MP3 files, Medicine MRI & CT scans, patient records, [Agrawal et al, EDBT 2011 Tutorial] 9/9/2015 Satish Srirama 28/45

What can we do with this wealth? What can we do? Scientific breakthroughs Business process efficiencies Realistic special effects Improve quality-of-life: healthcare, transportation, environmental disasters, daily life, Could We Do More? YES: but need major advances in our capability to analyze this data [Agrawal et al, EDBT 2011 Tutorial] 9/9/2015 Satish Srirama 29

Platforms for Data Analysis Data Warehousing, Data Analytics & Decision Support Systems Data Analytics in the Web Context Data Analytics in the Cloud 9/9/2015 Satish Srirama 30/45

Data Warehousing, Data Analytics & Decision Support Systems Used to manage and control business Transactional Data: historical or point-in-time Optimized for inquiry rather than update Use of the system is loosely defined and can be ad-hoc Used by managers and analysts to understand the business and make judgments Limited scalability [Agrawal et al, EDBT 2011 Tutorial] 9/9/2015 Satish Srirama 31/45

Information Retrieval & Data Analytics in the Web Context(Lecture4) Source Selection Resource Query Formulation Query Search Results System discovery Vocabulary discovery Concept discovery Document discovery Selection Documents Examination Information source reselection Delivery 9/9/2015 Satish Srirama 32

Data Analytics in the Cloud Scalability to large data volumes: Scan 100 TB on 1 node @ 50 MB/sec = 23 days Scan on 1000-node cluster = 33 minutes Divide-And-Conquer (i.e., data partitioning) This is where MapReduce jumps in Replicate data and computation (Lectures 2, 3, 4, 7) Cost-efficiency: Commodity nodes (cheap, but unreliable) Commodity network Automatic fault-tolerance 9/9/2015 Satish Srirama 33/45

BigDataSolutions We mainly study Hadoop MapReduce Hadoop MapReduceis one of the most used solution for large scale data analysis. Spark will be discussed as an alternative (Lecture 7) In-Memory MapReduce However, Hadoop has some troubles Writing low level MapReducecode is slow Need a lot of expertise to optimize MapReducecode Prototyping requires compilation A lot of custom code is required Hard to manage more complex MapReduce job chains 9/9/2015 Satish Srirama 34/45

Pig & Hive (Lecture 5 & 6) Pig is a high level language on top of Hadoop MapReduce Models a scripting language (Latin) Fast prototyping Similar to declarative SQL Easier to get started In comparison to Hadoop MapReduce: 5% of the code 5% of the time Hive A data warehouse infrastructure that provides data summarization and ad hoc querying. Developed by Facebook 9/9/2015 Satish Srirama 35/45

Scaling Information Systems Fault tolerance, high availability & scalability are essential prerequisites for any enterprise application deployment Scalability Generally nodes in information systems support specific load When load increases beyond certain level, systems should be scaled up Similarly when load decreases they should be scaled down 9/9/2015 Satish Srirama 36/45

Scaling Information Systems - continued Two basic models of scaling Vertical scaling Also known as Scale-up Achieving better performance by replacing an existing node with a much powerful machine Risk of losing currently running jobs Horizontal scaling Aka Scale-out Achieving better performance by adding more nodes to the system New servers are introduced to the system to run along with the existing servers 9/9/2015 Satish Srirama 37/45

Scaling Enterprise Applications in the Cloud Client Site Client Site Client Site Load Balancer (Proxy) App Server App Server App Server App Server App Server DB Memcache 9/9/2015 Satish Srirama 38

Load Balancer Load balancing has been a key mechanism in making efficient web server farms Load balancer automatically distributes incoming application traffic across multiple servers Some of the key load balancing algorithms include Random, Round Robin, Least connection Examples: Nginx-http://nginx.org/ HAProxy- http://haproxy.1wt.eu/ Pen -http://siag.nu/pen/ 9/9/2015 Satish Srirama 39/45

Scaling in the Cloud -bottleneck Client Site Client Site Client Site Load Balancer (Proxy) App Server App Server App Server App Server App Server Database becomes the DB Memcache Scalability Bottleneck Cannot leverage elasticity 9/9/2015 Satish Srirama 40

Scaling in the Cloud -bottleneck Client Site Client Site Client Site Load Balancer (Proxy) App Server App Server App Server App Server App Server Scalable and Elastic, Key Value DB Stores Memcache but limited consistency and operational flexibility NoSQL 9/9/2015 Satish Srirama (Lecture 8) 41

This week in Lab (Homework) Registration to the cloud & keys Working with Eucatools and API Some small programming task to get you started 9/9/2015 Satish Srirama 42/45

Next lecture Introduction to MapReduce 9/9/2015 Satish Srirama 43/45

References Several of the slides are taken from Prof. Anthony D. Joseph s lecture at RWTH Aachen (March 2010) Some slides are also taken from Big Data and Cloud Computing: Current State and Future Opportunities, tutorial at EDBT 2011 by D. Agrawal, S. Das and A. Abbadi. M. Armbrustet al., Above the Clouds, A Berkeley View of Cloud Computing, Technical Report, University of California, Feb, 2009. 9/9/2015 Satish Srirama 44/45