A View of Cloud Computing: Concepts and Challenges Jorge G. Barbosa Universidade do Porto, Faculdade de Engenharia, LIACC Porto, Portugal jbarbosa@fe.up.pt FEUP, 2013 Outline Part I: Basic Concepts Introduction and Principals Overview Part II: Challenges Fault Tolerance Energy optimization Quality of Service (QoS) Part III: Current Research 2 1
3 What is Cloud Computing? Cloud Computing refers to both the applications delivered as services over the Internet and the hardware and systems software in the datacenters that provide those services. Fox, Armando, et al. "Above the clouds: A Berkeley view of cloud computing." Dept. Electrical Eng. and Comput. Sciences, University of California, Berkeley, Rep. UCB/EECS 28 (2009). A large-scale distributed computing paradigm ( ) in which a pool of abstracted, virtualized, dynamically-scalable, managed computing power, storage, platforms, and services are delivered on demand( ) over the Internet. Foster, Ian, et al. "Cloud computing and grid computing 360-degree compared." Grid Computing Environments Workshop, GCE'08. Ieee, 2008. 4 2
Clouds Cloud Computing Image source: The Future of Cloud Computing, available at http://cordis.europa.eu/fp7/ict/ssai/docs/cloud-report-final.pdf 6 3
TYPES SaaS (Software as a Service) What On-demand access to any application Who End-user(consume) PaaS (Platform as a Service) IaaS (Infrastructure as a Service) Platform upon which apps/services can be developed and hosted Access tocomputacional resources, i.e. CPU, RAM, Data & Storage Developer(build) Hosts provider(host) 7 MODES Usually owned by an institution; functionalities not directly exposed to the consumer(ex.: ebay) Mixed employment of private and public infrastructures, so as to reduce costs by sharing, but with desired degree of control Image source: http://www.iland.com Owner offer their services to users outside of the institution (ex.: Amazon, Google Apps) 8 4
FEATURES Elasticity Leveraged by self-* provides agility and adaptability to environment changes Implies horizontal and vertical scalabilities Reliability and Availability Ensures constant operation through redundant resource usage (ex.: fault tolerance) Ability to deal with increasing concurrent access (ex.: loadbalancing) 9 BENEFITS Quality of Service Support and maintenance of specified users requirements to be met by the services and/or resources (ex.: response time) Pay per use Services sold as Utility Computing, costs according to the actual consumption of resources Going Green Reduce additional costs of energy consumption, but also to reduce the carbon footprint 10 5
Virtualization Technology in Clouds Virtualization is an essential technology in the Cloud Provides all the cloud features (e.g. ease of use, flexibility and adaptability, location independence, etc.) Image source: http:// http://blog.cloudpassage.com 11 12 6
Hot Topics in Cloud Research Fault tolerance Business continuity and service availability Energy efficiency Optimize energy consumption (ex.: maximize Mflop/ Joule) Green cloud computing -minimize operational costs but also reduce the environmental impact Quality of Service Performance unpredictability (ex.: due to sharing of resources among co-located s) 13 Hot Topics in Cloud Research Security Data security Interoperability How different clouds cooperate? Normalization How to guarantee that a user can change the cloud provider? Autonomic Computing 14 7
Fault Tolerance Dependability of the infrastructure Distributed systems are growing in scale and in complexity Mean Time Between Failures (MTBF) would be 1.25hon a petaflopsystem (1) (1) Fu, S. "Failure-aware resource management for high-availability computing clusters with distributed virtual machines." Journal of Parallel and Distributed Computing 70.4 (2010): 384-393. 15 Fault Tolerance Proactive fault tolerance Intelligent performance monitoring interface (IPMI) for health inquires (migration starts for threshold violations) Ganglia to determine node targets based on load averages In proactive FT systems, processes automatically migrate from unhealthy nodes to healthy ones. In reactive schemes, recovery occurs in response to already occurred failures. Overall architecture Nagarajan, A., et al. "Proactive fault tolerance for HPC with Xenvirtualization." Proc. of the 21st annual international conference on Supercomputing. ACM, 2007. 16 8
Fault Tolerance Dynamic allocation of s, considering PMs reliability Based in a failure predictor tool with 75% of average accuracy (1) Optimistic Best-Fit (OBFIT) algorithm -Selects the PM with minimum weighted available capacity and reliability (1) Pessimistic Best-Fit (PBFIT) algorithm -Calculates average capacity C avg from reliable PMs -Selects the unreliable PM pwith capacity C p such that C avg + C p results in the minimum necessary capacity Proposed architecture for reconfigurable distributed (1) Fu, S. "Failure-aware resource management for high-availability computing clusters with distributed virtual machines." Journal of Parallel and Distributed Computing 70.4 (2010): 384-393. 17 Fault Tolerance Dynamic allocation of s, considering PMs reliability System productivity is enhanced by using proposed strategies Task completion rate reaches 91.7% with 83.6% utilization of relatively unreliable nodes Percentage of completed jobs Percentage of completed tasks 18 9
Hot Topics in Cloud Research Fault tolerance Business continuity and service availability Energy efficiency Optimize energy consumption (ex.: maximize Mflop/ Joule) Green cloud computing -minimize operational costs but also reduce the environmental impact Quality of Service Performance unpredictability (ex.: due to sharing of resources among co-located s) 19 Energy Efficiency Energy consumption concern An average datacenter consumes as much energy as 25000 households (1) Main part of energy consumption determined by the CPU (2) Energy consumption dominates the operational costs (1) Kaplan, J., Forrest, W., Kindler, N., Revolutionizing Data Center Energy Efficiency, McKinsey& Company, Tech. Rep. (2) Berl, Andreas, et al. "Energy-efficient cloud computing." The Computer Journal 53.7 (2010): 1045-1051. 20 10
Energy Efficiency Consolidation Minimize the number of active nodes, and powering down inactive ones Dynamic Voltage Frequency Scaling (DVFS) Modern CPUs can run at different clock frequencies 21 Energy Efficiency - Examples Entropy system Minimize the number of active nodes, and powering down inactive ones, while maintaining the performance Find a configuration using the minimum numbern of nodes necessary to host all s Constraint programming allows Entropy to find mappings of tasks to nodes Reconfiguration loop Hermenier, F., et al. "Entropy: a consolidation manager for clusters." Proc. of the 2009 ACM SIGPLAN/SIGOPS international conference on Virtual execution environments. ACM, 2009. 22 11
Energy Efficiency - Examples Entropy system results Reduces consumption of cluster nodes per hour by over 50% as compared to static allocation Number of used physical machines Total execution time 23 Energy Efficiency - Examples DVFS-enabled clusters Algorithm minimizes the processor power dissipation by dynamically scaling down processor frequencies 1) Minimize the processor supply voltage by scaling down the processor frequency. 2) Schedule s to PEs with low voltages and try not to scale PE to high voltages. von Laszewski, G., et al. "Power-aware scheduling of virtual machines in dvfs-enabled clusters." Cluster Computing and Workshops, 2009. CLUSTER'09. IEEE International Conference on. IEEE, 2009. Working scenario 24 12
Energy Efficiency DVFS-enabled clusters results Applying DVFS technique to the compute nodes (PEs) reduces overall power consumption without degrading the s performance beyond unacceptable levels Performance impact of varying the number of s and operating frequency DVFS-enabled cluster scheduling simulation results 25 Hot Topics in Cloud Research Fault tolerance Business continuity and service availability Energy efficiency Optimize energy consumption (ex.: maximize Mflop/ Joule) Green cloud computing -minimize operational costs but also reduce the environmental impact Quality of Service Performance unpredictability (ex.: due to sharing of resources among co-located s) 26 13
Quality of Service - Examples Enforcing SLAs in scientific clouds Deadline-driven batch jobs Service Level Agreement (SLA) 1) Tests the feasibility of the SLA. 2) If accepted, guarantees its fulfillment. Approach is independent of the underlying cloud infrastructure and should deal with performance fluctuations The fuzzy control system Niehorster, O., et al. "Enforcing SLAs in scientific clouds." Cluster Computing (CLUSTER), 2010 IEEE International Conference on. IEEE, 2010. 27 Quality of Service - Examples Enforcing SLAs in scientific clouds Agents autonomouslyproof the feasibility of the SLA, and guarantee the fulfillment of the SLA meeting the deadline Agents successfully deal with noisein the cloud that occurs when s are co-located interference due to resource sharing (RAM, I/O, CPU) 28 14
Quality of Service - Examples Sandpiper system Hotspot detection algorithm, determines when to resize or migrate s Hotspot mitigation algorithm, determines what and where to migrate and how many resources to allocate Migrate the s in decreasing order of VSR VSR : volume-to-size ration (size = RAM footprint; volume = load) The Sandpiper architecture Wood, T., et al. "Sandpiper: Black-box and gray-box resource management for virtual machines." Computer Networks 53.17 (2009): 2923-2938. 29 Quality of Service Sandpiper system results Sandpiper can resize resources allocated to s Migrations occur if additional resources are not available A series of migrations resolve hotspots 30 15
31 Approach The goal Construct power- and failure-aware computing environments, in order to maximize the rate of completed jobs by their deadlines Pure Performance Higher Service Level Performance 32 16
Approach Construct power- and failure-aware computing environments, in order to maximize the rate of completed jobs by their deadlines It is a SLA based approach But SLA agreement should consider user compensations if the deadline is missed Virtual-to-physical resources mapping decisions consider both the power-efficiency, and reliability level of compute nodes Dynamic update of virtual-to-physical configurations (CPU usage and migration) 33 Approach Leverage virtualization tools Xen credit scheduler Dynamically update cap parameter CPU% 100 CPU Power consumption Increasing Stop & copy migration Faster migrations, preferable for proactive failure management 0 PM3 time PM2 PM1 Failure Stop & copy migration Failure prediction accuracy 34 17
System Overview Cloud architecture Private cloud Homogenous PMs Cluster coordinator manages user jobs s are created and destroyed dynamically Users jobs A jobis a set of independent tasks Private cloud management architecture A task runs in a single, which CPU-intensive workload is known Number of tasks per job and tasks deadlines are defined by users 35 System Overview Power model Capacity-reliability model Example of power efficiency curve (p1 = 175W, p2 = 75W) 36 18
Performance Analysis Minimum Time Task Execution (MTTE) algorithm Slack time to accomplish task t PM i capacity constraints Selects PM ithat: guarantees minimum processing power required by the increases power-efficiency has higher reliability But reserves maximum processing power 37 Performance Analysis Relaxed Time Task Execution (RTTE) algorithm 100% 0% Host CPU Cap set in Xen credit scheduler Unlike MTTE, the RTTE algorithm always reserves to the minimum amount of processing power necessary to accomplish the task within its deadline However, RTTE is work-conserving 38 19
Performance Analysis Implementation considerations Stabilization to avoid multiple migrations Algorithms compared to ours Common Best-Fit (CBFIT) Selects the PM with the maximum power-efficiency and do not consider resources reliability Optimistic Best-Fit (OBFIT) Pessimistic Best-Fit (PBFIT) 39 Performance Analysis Simulation setup 50 PMs, each modeled with one CPU core with the performance equivalent to 800 MFLOPS s require 128MB to 1024MB RAM s stop & copy migration overhead depends on RAM size 100 synthetic jobs, each being composed in average of 10 CPU-intensive workload tasks Failed PMs stay unavailable during a period modeled by a Lognormal distribution,and its mean time was set to20 minutes, varying up to 150 minutes. Tasks deadline are set to 10% more than their minimum execution time Failures instants follow a Weibull distribution, with shape parameter of 0.8 MTBF = 200 minutes 40 20
Performance Analysis Metrics Completion rate of users jobs Working-Efficiency Measures the quantity of useful work done(i.e. completed users jobs) by the consumed power 41 Performance Analysis A View of Cloud Computing : Concepts and Challenges 42 21
Performance Analysis Google Cloud tracelogs o o o o o o The medium length of a job is 3 minutes, and the majority of jobs run in less than 15 minutes, despite there are a number of jobs that run longer than 300 minutes Tasks length follow a Lognormal distribution CPU usage, varying from near 0% to around 25%, follow a Lognormal distribution 3614 synthetic jobs for a total of 10357tasks MTBF = 200 minutes Migrations occurring due to proactive failure management only A View of Cloud Computing : Concepts and Challenges 43 Performance Analysis A View of Cloud Computing : Concepts and Challenges 44 22
Energy Efficiency Improvement The goal Mechanism to detect energy optimization opportunities, and maintaining fault tolerance to the computing environment Find out the closest to optimum values to correctly tune the condition detection mechanism Dynamic update of virtual-to-physical configurations (CPU usage and migration) PM3 time PM2 PM1 Failure Stop & copy migration Failure prediction accuracy 45 Consolidation results Without consolidation With consolidation A View of Cloud Computing : Concepts and Challenges 46 23
Consolidation results Without consolidation With consolidation A View of Cloud Computing : Concepts and Challenges 47 Consolidation results A View of Cloud Computing : Concepts and Challenges 48 24
Conclusions Cloud computing opens new challenges Energy efficiency (more Mflop/Joule) Dynamic load balancing s interference modeling due to resource sharing (CPU, CACHE, I/O) CPU intensive and Data intensive jobs Data locality Scalability (distributed control) Autonomic Computing CERN Cloud infrastructure MScdissertation (MIEIC) to study and develop a resource management algorithm for CERN cloud 49 50 25