Cloud Platforms, Challenges & Hadoop. Aditee Rele Karpagam Venkataraman Janani Ravi



Similar documents
Introduction to Cloud Computing

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Cloud Computing Trends

Cloud computing - Architecting in the cloud

Apache Hadoop: Past, Present, and Future

Cloud Computing using MapReduce, Hadoop, Spark

Cloud Courses Description

Migration Scenario: Migrating Backend Processing Pipeline to the AWS Cloud

CUMULUX WHICH CLOUD PLATFORM IS RIGHT FOR YOU? COMPARING CLOUD PLATFORMS. Review Business and Technology Series

The Cloud at Crawford. Evaluating the pros and cons of cloud computing and its use in claims management

References. Introduction to Database Systems CSE 444. Motivation. Basic Features. Outline: Database in the Cloud. Outline

Introduction to Database Systems CSE 444

CLOUD DEVELOPMENT BEST PRACTICES & SUPPORT APPLICATIONS

The Fastest Path to the Cloud Building Your SaaS Company on Force.com

Cloud Courses Description

Emerging Technology for the Next Decade

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Lecture 02a Cloud Computing I

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

24/11/14. During this course. Internet is everywhere. Frequency barrier hit. Management costs increase. Advanced Distributed Systems Cloud Computing

Google Apps Overview

APP DEVELOPMENT ON THE CLOUD MADE EASY WITH PAAS

Sriram Krishnan, Ph.D.

Overview. The Cloud. Characteristics and usage of the cloud Realities and risks of the cloud

C Examcollection.Premium.Exam.34q

Architectural Implications of Cloud Computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing

How cloud computing can transform your business landscape

Apache Hadoop. Alexandru Costan

BIG DATA TRENDS AND TECHNOLOGIES

Cloud Computing Paradigm

Data movement for globally deployed Big Data Hadoop architectures

Double-Take Cloud Migration Center (CMC) Tech Brief

Introduction to Cloud : Cloud and Cloud Storage. Lecture 2. Dr. Dalit Naor IBM Haifa Research Storage Systems. Dalit Naor, IBM Haifa Research

The Impact of PaaS on Business Transformation

Cloud Service Model. Selecting a cloud service model. Different cloud service models within the enterprise

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Operationalize Policies. Take Action. Establish Policies. Opportunity to use same tools and practices from desktop management in server environment

ArcGIS for Server in the Amazon Cloud. Michele Lundeen Esri

Cloud Based Document Management

Enabling Database-as-a-Service (DBaaS) within Enterprises or Cloud Offerings

Open source Google-style large scale data analysis with Hadoop

What Is It? Business Architecture Research Challenges Bibliography. Cloud Computing. Research Challenges Overview. Carlos Eduardo Moreira dos Santos

Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Cloud Computing, Virtualization & Green IT

Course 20465C: Designing a Data Solution with Microsoft SQL Server

Hadoop IST 734 SS CHUNG

Server Virtualization Cloud Partner Training Series

Virtualizing Apache Hadoop. June, 2012

Cloud Computing. Cloud computing:

A very short Intro to Hadoop

Cloud Optimize Your IT

9/26/2011. What is Virtualization? What are the different types of virtualization.

Cloud Computing and Big Data Analytics for Teaching & Research

Making a Smooth Transition to a Hybrid Cloud with Microsoft Cloud OS

Demystifying the Cloud Computing

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Part V Applications. What is cloud computing? SaaS has been around for awhile. Cloud Computing: General concepts

Public Cloud Offerings and Private Cloud Options. Week 2 Lecture 4. M. Ali Babar

Microsoft Private Cloud

INTRODUCTION TO CLOUD COMPUTING CEN483 PARALLEL AND DISTRIBUTED SYSTEMS

Cloud Computing Now and the Future Development of the IaaS

How To Run A Cloud Computer System

Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop and Map-Reduce. Swati Gore

IBM Spectrum Protect in the Cloud

DESIGN OF A PLATFORM OF VIRTUAL SERVICE CONTAINERS FOR SERVICE ORIENTED CLOUD COMPUTING. Carlos de Alfonso Andrés García Vicente Hernández

RESPONSES TO QUESTIONS AND REQUESTS FOR CLARIFICATION Updated 7/1/15 (Question 53 and 54)

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Harnessing the Power of the Microsoft Cloud for Deep Data Analytics

INTRODUCTION TO CASSANDRA

Big Data and Apache Hadoop s MapReduce

Cloud-based Services: To Move or Not To Move. Seminar Internet Economics Cristian Anastasiu & Taya Goubran

WHITE PAPER SETTING UP AND USING ESTATE MASTER ON THE CLOUD INTRODUCTION

AVLOR SERVER CLOUD RECOVERY

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

BMC Control-M for Cloud. BMC Control-M Workload Automation

Who moved my cloud? Part I: Introduction to Private, Public and Hybrid clouds and smooth migration

INTRODUCING CLOUD POWER

Professional Hadoop Solutions

SELENIUM GRID BUILD VS. BUY

CLOUD STORAGE USING HADOOP AND PLAY

Cloud Computing for SCADA

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

NEXT UP: John Sanderson, Windows Azure Specialist (Denver) Page 1

DISASTER RECOVERY WITH AWS

BEDIFFERENT A C E I N T E R N A T I O N A L

Building Secure Cloud Applications. On the Microsoft Windows Azure platform

Today: Data Centers & Cloud Computing" Data Centers"

Transcription:

Cloud Platforms, Challenges & Hadoop Aditee Rele Karpagam Venkataraman Janani Ravi

Cloud Platform Models Aditee Rele Microsoft Corporation Dec 8, 2010

IT CAPACITY Provisioning IT Capacity Under-supply of capacities Allocated IT-capacities Load Forecast Waste of capacities Fixed cost of IT-capacities Barrier for innovations Actual Load TIME

The Cloud Platform Continuum On-Premises Servers Hosted Servers Cloud Platform Bring your own machines, connectivity, software, etc. Complete control Complete responsibility Static capabilities Upfront capital costs for the infrastructure Renting machines, connectivity, software Less control Fewer responsibilities Lower capital costs More flexible Pay for fixed capacity, even if idle Shared, multi-tenant infrastructure Virtualized and dynamic Scalable and available Abstracted from the infrastructure Higher-level services Pay as you go

Legacy vs. cloud computing Storage Scale & High Availability Computation & Multi-Tenancy Automated Service Management

Types of cloud services EC2 VM Role - Azure Azure Compute & Storage, AppFabric, SQL Google App Engine Salesforce.com SOFTWARE AS A SERVICE Google Apps INFRASTRUCTURE AS A SERVICE PLATFORM AS A SERVICE MS Online Services Facebook

You manage You manage Cloud Taxonomy You manage (On-Premises) Infrastructure (as a Service) Platform (as a Service) Software (as a Service) Applications Applications Applications Applications Data Data Data Data Runtime Middleware O/S Virtualization Servers Storage Networking Runtime Middleware O/S Virtualization Servers Storage Networking Managed by vendor Runtime Middleware O/S Virtualization Servers Storage Networking Managed by vendor Runtime Middleware O/S Virtualization Servers Storage Networking Managed by vendor

Traditional On-Premises Model Servers are dedicated to specific workloads Individual servers sized for peak or average capacity of a given workload Substantial idle/wasted capacity An application can t scale beyond the boundaries of boxes it resides on Provisioning new capacity takes time Private Cloud Model Servers are treated as a virtual pool of resources Apps consume from the pool rather than having dedicated resources Idle servers automatically shut down or put to sleep until needed Apps can scale to the available provisioned capacity in the pool Adding a new server adds capacity to the entire pool for all apps Dedicated infrastructure (i.e., Cloud resources are only accessible to your company, and not shared with others)

Microsoft Cloud Services

Challenges building cloud apps for Enterprise Janani Ravi Google Hyderabad

Traditional Enterprise applications Desktop base, typically single machine, single user Collaboration may not be a primary consideration Data stored within the Enterprise and owned by it Performance, scalability, security issues based on local data storage and access Discretionary upgrades based on Enterprise needs Enterprise responsible for backup, recovery, troubleshooting Localized failures and support issues, usually isolated to the Enterprise

Cloud-based applications Multi-user access, realtime collaboration, conflict resolution Offline access what if the user is not connected to the internet? User interface usually browser based getting all browsers to work Latency and scalability for users at different locations Build a developer universe

Building enterprise web applications Feedback, feature requests Configures, troubleshoots Application developer Enterprise web applications Designs, implements, operates Uses Services outage info, support Enterprise administrators End users Internal feedback

Challenges: Migration of existing data Enterprises usually have fully provisioned users and roles and existing applications like email, calendar etc. o Tools for reliable data migration o Tools for interoperability with the older systems for partial migrations o Use single sign on or other methods to accept authentication from other systems

Challenges: Administrative tools Move to the cloud perceived as a loss of control o Requires good tools which allow enterprise admins to configure and manage services o Provide access control to manage different kinds of administrators o More transparency and monitoring tools for troubleshooting o Logs and audit reports to track activities

Challenges: Data location and ownership Organizations might care about where their data is stored, usually for legal reasons o Build controls which determine where data is located Organizations might care about which regions the data passes through "over the wire" o Much harder to address routing

Challenges: Data availability What happens if there is a major disaster? o Geographically distributed data centers How often has the system been down in the last few quarters? o Have a backup plan with multiple data centers Do you have scheduled downtimes? How do I access my data during downtimes? o Have good communications set up Provide a good offline story which is easy to use

Challenges: Data retrieval and tracking How do administrators track suspicious activity on an account? o Easy-to-use tools with logging and audit information to track this down o Meta logs with access and tracking information Report statistics and analytics to know how users use the applications Monitoring to track activity to determine patterns

Challenges: Upgrades and bug fixes Easier to fix bugs since explicit patches are not required. However easier to make inadvertent changes Enterprises often do not support frequent updates, need to have known rollout plans

In conclusion Administration, access, collaboration etc gets easier in the cloud Many hurdles to overcome before this becomes a reality for all enterprises

Things I ve worked on Offline capability on docs using Google Gears Data model and UI design on the next generation Google word processor Platform to manage policies for Enterprises And previously UI design and implementation for the IIS administrative tools

Cloud Platform Intro to Hadoop Karpagam Venkataraman Yahoo! Dec 8, 2010

Cloud Platform Cloud Platforms - foundations for building applications Loosely coupled Collection of services Semantics-free Broadly applicable Fault-tolerant over commodity hardware

What s in the Cloud Platform? Simple Web Service API s Cloud Platform Provisioning & Virtualization Analytical Data Storage & Processing Operational Storage & Processing Edge Content Services Other Services Messaging, Workflow, virtual DBs & Webserving ID & Account Management Security Metering, Billing, Accounting Monitoring & QoS Shared Infrastructure Fast Provisioning and Machine Virtualization Analytical Data Storage and Processing Operational Storage Edge Content Services Rest of this session

What is Hadoop? A scalable fault-tolerant cloud operating system for big data storage and processing A framework that provides distributed application services Operates on unstructured and structured data A large and active ecosystem Open source under the friendly Apache License

Hadoop Core Components Hadoop Distributed File System distributed storage MapReduce programming paradigm parallel applications

User Karishma Anand Karishma Sneha Anand Sneha Karishma Visits URL www.cnn.com www.myblog.com www.myblog.com www.crap.com www.flickr.com www.myblog.com www.crap.com Example Data Analysis Application Find users who tend to visit good pages. Logic: Average page rank per user > 0.5 Time 8:00 8:05 10:00 10:15 12:00 12:02 12:30 Page_Visits User Karishma Anand Karishma Sneha Anand Sneha Karishma Pages URL www.cnn.com www.flickr.com www.myblog.com www.crap.com URL www.cnn.com www.myblog.com www.myblog.com www.crap.com www.flickr.com www.myblog.com www.crap.com Page Rank 0.9 0.9 0.7 0.2 Page Rank 0.9 0.7 0.7 0.2 0.9 0.7 0.2 Time 8:00 8:05 10:00 10:15 12:00 12:02 12:30

Map Reduce Divides the job into smaller tasks Location aware division of input Job Tracker - Schedules jobs across task tracker slaves Task Tracker runs data local computation task Each task is a map task or a reduce task. Language independent Data Definition Language Customizers Combiner, Partitioner, mapper (filename, file-contents): for each line in file-contents: fields = split(line, \t ) pg_rank = fields(3) emit (user, pg_rank) reducer (user, values): sum = 0 for each value in values: sum = sum + value avg_pg_rank = sum / sizeof(values) if avg_pg_rank > 0.5 emit (user, avg_pg_rank)

Hadoop - Data Flow What happens when we submit a job? Hadoop determines where the input data is located. Calulates number of splits required Split Size is computed as max(min(block_size, data/#maps), min_split_size) Creates tasks Copies necessary files to all nodes, and each slave node runs a task Once map tasks are over, starts reduce tasks, Collect output What user need to specify: Mapper class Reducer class Job configuration: job name, number of maps, reduces, any values required by the map and reduce classes etc. Build the code into a jar file and submit.

Split 1 Karishma\twww.myblog.com\t10:00 \t0.7 Sneha\twww.crap.com\t10:15\t0.2 Split i Karishma\twww.cnn.com\t8:00\t 0.9 Anand\twww.myblog.com\t8:07\ t0.7 Anand\twww.flickr.com\t12:00\t0.9 Sneha\twww.myblog.com\t12:02\t0. 7 Karishma\twww.crap.com\t12:30\t0. 2 Split M (docid, text) (docid, text) (docid, text) Example Application Data Flow mapper (filename, file-contents): for each line in file-contents: fields = split(line, \t ) pg_rank = fields(3) emit (user, pg_rank) Map 1 Map i Map M (user, pg_rank) Job Configuration: # of Maps = M # of Reducers = R Karishma, 0.7 (sorted users, pg_ranks) Shuffle reducer (user, values): sum = 0 for each value in values: sum = sum + value avg_pg_rank = sum / sizeof(values) if avg_pg_rank > 0.5 emit (user, avg_pg_rank) Reduce 1 Reduce i Reduce R (sorted users, avg_pg_ranks) Anand, 0.8 (sorted users, avg_pg_ranks) Karishma, 0.6 (sorted users, avg_pg_ranks) Output File 1 Output File i Output File R

Thank You! References Hadoop wiki http://wiki.apache.org/hadoop/ Hadoop Tutorial at Yahoo! http://developer.yahoo.com/hadoop/tutorial/module1.html Amazon Elastic MapReduce http://aws.amazon.com/elasticmapreduce/ Google MapReduce paper http://labs.google.com/papers/mapreduce.html Microsoft Dryad http://research.microsoft.com/en-us/projects/dryad/

Appendix

HDFS Distributes data across nodes; Reliability through replication Rack aware; Load balancing across nodes Name Node manages the file system metadata Data Node - Stores and serves blocks of data