Cloud Platforms, Challenges & Hadoop Aditee Rele Karpagam Venkataraman Janani Ravi
Cloud Platform Models Aditee Rele Microsoft Corporation Dec 8, 2010
IT CAPACITY Provisioning IT Capacity Under-supply of capacities Allocated IT-capacities Load Forecast Waste of capacities Fixed cost of IT-capacities Barrier for innovations Actual Load TIME
The Cloud Platform Continuum On-Premises Servers Hosted Servers Cloud Platform Bring your own machines, connectivity, software, etc. Complete control Complete responsibility Static capabilities Upfront capital costs for the infrastructure Renting machines, connectivity, software Less control Fewer responsibilities Lower capital costs More flexible Pay for fixed capacity, even if idle Shared, multi-tenant infrastructure Virtualized and dynamic Scalable and available Abstracted from the infrastructure Higher-level services Pay as you go
Legacy vs. cloud computing Storage Scale & High Availability Computation & Multi-Tenancy Automated Service Management
Types of cloud services EC2 VM Role - Azure Azure Compute & Storage, AppFabric, SQL Google App Engine Salesforce.com SOFTWARE AS A SERVICE Google Apps INFRASTRUCTURE AS A SERVICE PLATFORM AS A SERVICE MS Online Services Facebook
You manage You manage Cloud Taxonomy You manage (On-Premises) Infrastructure (as a Service) Platform (as a Service) Software (as a Service) Applications Applications Applications Applications Data Data Data Data Runtime Middleware O/S Virtualization Servers Storage Networking Runtime Middleware O/S Virtualization Servers Storage Networking Managed by vendor Runtime Middleware O/S Virtualization Servers Storage Networking Managed by vendor Runtime Middleware O/S Virtualization Servers Storage Networking Managed by vendor
Traditional On-Premises Model Servers are dedicated to specific workloads Individual servers sized for peak or average capacity of a given workload Substantial idle/wasted capacity An application can t scale beyond the boundaries of boxes it resides on Provisioning new capacity takes time Private Cloud Model Servers are treated as a virtual pool of resources Apps consume from the pool rather than having dedicated resources Idle servers automatically shut down or put to sleep until needed Apps can scale to the available provisioned capacity in the pool Adding a new server adds capacity to the entire pool for all apps Dedicated infrastructure (i.e., Cloud resources are only accessible to your company, and not shared with others)
Microsoft Cloud Services
Challenges building cloud apps for Enterprise Janani Ravi Google Hyderabad
Traditional Enterprise applications Desktop base, typically single machine, single user Collaboration may not be a primary consideration Data stored within the Enterprise and owned by it Performance, scalability, security issues based on local data storage and access Discretionary upgrades based on Enterprise needs Enterprise responsible for backup, recovery, troubleshooting Localized failures and support issues, usually isolated to the Enterprise
Cloud-based applications Multi-user access, realtime collaboration, conflict resolution Offline access what if the user is not connected to the internet? User interface usually browser based getting all browsers to work Latency and scalability for users at different locations Build a developer universe
Building enterprise web applications Feedback, feature requests Configures, troubleshoots Application developer Enterprise web applications Designs, implements, operates Uses Services outage info, support Enterprise administrators End users Internal feedback
Challenges: Migration of existing data Enterprises usually have fully provisioned users and roles and existing applications like email, calendar etc. o Tools for reliable data migration o Tools for interoperability with the older systems for partial migrations o Use single sign on or other methods to accept authentication from other systems
Challenges: Administrative tools Move to the cloud perceived as a loss of control o Requires good tools which allow enterprise admins to configure and manage services o Provide access control to manage different kinds of administrators o More transparency and monitoring tools for troubleshooting o Logs and audit reports to track activities
Challenges: Data location and ownership Organizations might care about where their data is stored, usually for legal reasons o Build controls which determine where data is located Organizations might care about which regions the data passes through "over the wire" o Much harder to address routing
Challenges: Data availability What happens if there is a major disaster? o Geographically distributed data centers How often has the system been down in the last few quarters? o Have a backup plan with multiple data centers Do you have scheduled downtimes? How do I access my data during downtimes? o Have good communications set up Provide a good offline story which is easy to use
Challenges: Data retrieval and tracking How do administrators track suspicious activity on an account? o Easy-to-use tools with logging and audit information to track this down o Meta logs with access and tracking information Report statistics and analytics to know how users use the applications Monitoring to track activity to determine patterns
Challenges: Upgrades and bug fixes Easier to fix bugs since explicit patches are not required. However easier to make inadvertent changes Enterprises often do not support frequent updates, need to have known rollout plans
In conclusion Administration, access, collaboration etc gets easier in the cloud Many hurdles to overcome before this becomes a reality for all enterprises
Things I ve worked on Offline capability on docs using Google Gears Data model and UI design on the next generation Google word processor Platform to manage policies for Enterprises And previously UI design and implementation for the IIS administrative tools
Cloud Platform Intro to Hadoop Karpagam Venkataraman Yahoo! Dec 8, 2010
Cloud Platform Cloud Platforms - foundations for building applications Loosely coupled Collection of services Semantics-free Broadly applicable Fault-tolerant over commodity hardware
What s in the Cloud Platform? Simple Web Service API s Cloud Platform Provisioning & Virtualization Analytical Data Storage & Processing Operational Storage & Processing Edge Content Services Other Services Messaging, Workflow, virtual DBs & Webserving ID & Account Management Security Metering, Billing, Accounting Monitoring & QoS Shared Infrastructure Fast Provisioning and Machine Virtualization Analytical Data Storage and Processing Operational Storage Edge Content Services Rest of this session
What is Hadoop? A scalable fault-tolerant cloud operating system for big data storage and processing A framework that provides distributed application services Operates on unstructured and structured data A large and active ecosystem Open source under the friendly Apache License
Hadoop Core Components Hadoop Distributed File System distributed storage MapReduce programming paradigm parallel applications
User Karishma Anand Karishma Sneha Anand Sneha Karishma Visits URL www.cnn.com www.myblog.com www.myblog.com www.crap.com www.flickr.com www.myblog.com www.crap.com Example Data Analysis Application Find users who tend to visit good pages. Logic: Average page rank per user > 0.5 Time 8:00 8:05 10:00 10:15 12:00 12:02 12:30 Page_Visits User Karishma Anand Karishma Sneha Anand Sneha Karishma Pages URL www.cnn.com www.flickr.com www.myblog.com www.crap.com URL www.cnn.com www.myblog.com www.myblog.com www.crap.com www.flickr.com www.myblog.com www.crap.com Page Rank 0.9 0.9 0.7 0.2 Page Rank 0.9 0.7 0.7 0.2 0.9 0.7 0.2 Time 8:00 8:05 10:00 10:15 12:00 12:02 12:30
Map Reduce Divides the job into smaller tasks Location aware division of input Job Tracker - Schedules jobs across task tracker slaves Task Tracker runs data local computation task Each task is a map task or a reduce task. Language independent Data Definition Language Customizers Combiner, Partitioner, mapper (filename, file-contents): for each line in file-contents: fields = split(line, \t ) pg_rank = fields(3) emit (user, pg_rank) reducer (user, values): sum = 0 for each value in values: sum = sum + value avg_pg_rank = sum / sizeof(values) if avg_pg_rank > 0.5 emit (user, avg_pg_rank)
Hadoop - Data Flow What happens when we submit a job? Hadoop determines where the input data is located. Calulates number of splits required Split Size is computed as max(min(block_size, data/#maps), min_split_size) Creates tasks Copies necessary files to all nodes, and each slave node runs a task Once map tasks are over, starts reduce tasks, Collect output What user need to specify: Mapper class Reducer class Job configuration: job name, number of maps, reduces, any values required by the map and reduce classes etc. Build the code into a jar file and submit.
Split 1 Karishma\twww.myblog.com\t10:00 \t0.7 Sneha\twww.crap.com\t10:15\t0.2 Split i Karishma\twww.cnn.com\t8:00\t 0.9 Anand\twww.myblog.com\t8:07\ t0.7 Anand\twww.flickr.com\t12:00\t0.9 Sneha\twww.myblog.com\t12:02\t0. 7 Karishma\twww.crap.com\t12:30\t0. 2 Split M (docid, text) (docid, text) (docid, text) Example Application Data Flow mapper (filename, file-contents): for each line in file-contents: fields = split(line, \t ) pg_rank = fields(3) emit (user, pg_rank) Map 1 Map i Map M (user, pg_rank) Job Configuration: # of Maps = M # of Reducers = R Karishma, 0.7 (sorted users, pg_ranks) Shuffle reducer (user, values): sum = 0 for each value in values: sum = sum + value avg_pg_rank = sum / sizeof(values) if avg_pg_rank > 0.5 emit (user, avg_pg_rank) Reduce 1 Reduce i Reduce R (sorted users, avg_pg_ranks) Anand, 0.8 (sorted users, avg_pg_ranks) Karishma, 0.6 (sorted users, avg_pg_ranks) Output File 1 Output File i Output File R
Thank You! References Hadoop wiki http://wiki.apache.org/hadoop/ Hadoop Tutorial at Yahoo! http://developer.yahoo.com/hadoop/tutorial/module1.html Amazon Elastic MapReduce http://aws.amazon.com/elasticmapreduce/ Google MapReduce paper http://labs.google.com/papers/mapreduce.html Microsoft Dryad http://research.microsoft.com/en-us/projects/dryad/
Appendix
HDFS Distributes data across nodes; Reliability through replication Rack aware; Load balancing across nodes Name Node manages the file system metadata Data Node - Stores and serves blocks of data