PerfCompass: Toward Runtime Performance Anomaly Fault Localization for Infrastructure-as-a-Service Clouds
|
|
|
- Marian Webb
- 10 years ago
- Views:
Transcription
1 PerfCompass: Toward Runtime Performance Anomaly Fault Localization for Infrastructure-as-a-Service Clouds Daniel J. Dean, Hiep Nguyen, Peipei Wang, Xiaohui Gu Department of Computer Science North Carolina State University Abstract Infrastructure-as-a-service (IaaS) clouds are becoming widely adopted. However, as multiple tenants share the same physical resources, performance anomalies have become one of the top concerns for users. Unfortunately, performance anomaly diagnosis in the production IaaS cloud often takes a long time due to its inherent complexity and sharing nature. In this paper, we present PerfCompass, a runtime performance anomaly fault localization tool using online system call trace analysis techniques. Specifically, PerfCompass tackles a challenging fault localization problem for IaaS clouds, that is, differentiating whether a production-run performance anomaly is caused by an external fault (e.g., interference from other co-located applications) or an internal fault (e.g., software bug). PerfCompass does not require any application source code or runtime instrumentation, which makes it practical for production IaaS clouds. We have tested PerfCompass using a set of popular software systems (e.g., Apache, MySQL, Squid, Cassandra, Hadoop) and a range of common cloud environment issues and real software bugs. The results show that PerfCompass accurately diagnoses all the faults while imposing low overhead during normal application execution time. 1 Introduction Infrastructure-as-a-service (IaaS) clouds [4] have become increasingly popular by providing a cost-effective resource sharing platform. While multi-tenant shared hosting has many benefits, there is also greater risk for performance anomalies (e.g., service outages [5], service level objective violations) to arise due to its inherent complexity and sharing nature. When an application in the IaaS cloud does not perform as expected, it is important to localize the cause of the performance anomaly quickly in order to minimize the performance penalty. However, diagnosing performance anomalies in a production IaaS cloud is extremely challenging. Shared computing environments mean that a performance anomaly can be caused by either external faults (e.g., interference from other co-located applications, improper resource allocations) or internal faults (e.g., application software bugs). Differentiating between these two types of faults is critical as the steps taken to handle external and internal faults are quite different. Particularly, as many external problems can be quickly fixed using virtual machine (VM) migration [13] or resource scaling [28, 21], we can avoid wasting time on unnecessary application debugging. 1.1 Summary of the State of the Art Existing runtime performance anomaly fault localization tools (e.g.,[31, 25, 23, 17, 12, 11, 3, 9, 18, 24, 22]) can only provide coarse-grained fault localization such as identifying faulty components in a distributed application. In contrast, white-box or gray-box schemes (e.g., [19, 10, 30, 16]) can provide fine-grained diagnosis such as localizing the faulty functions or code block. However, those approaches often require application source code and expensive instrumentation, which makes them impractical for production IaaS clouds. Although previous work [15, 27, 26] also developed trace-based performance debugging tools, most of them need an extensive profiling phase to extract application performance models in advance, which are often infeasible in IaaS clouds. Some other performance debugging tools [16] need to combine user space and kernel space tracing, which can impart high overhead (up to 280%) to the application. Moreover, the existing performance anomaly diagnosis solutions do not address the unique fault localization problems in IaaS clouds such as distinguishing between external and internal faults.
2 1.2 Our Contributions In this paper, we present PerfCompass, a light-weight runtime performance anomaly fault localization tool designed to be used by either IaaS cloud administrators or cloud service users. PerfCompass does not require application source code or any instrumentation, making it practical for production IaaS clouds. PerfCompass uses a kernel-level tracing tool [14] to collect system call traces with low overhead (1.03% on average) and performs online analysis over the system call trace to achieve finegrained fault localization. Thus, PerfCompass does not require any advance application profiling, allowing it to diagnose previously seen or unseen anomalies. By using kernel-level system call traces only, our approach is both light-weight and generic, applicable to any multi-processed or multi-threaded application written in different programming languages (e.g., both compiled and interpreted programs). Particularly, PerfCompass focuses on diagnosing whether a performance anomaly is caused by an internal or external fault. This diagnosis result is critical for achieving efficient performance anomaly correction in production IaaS clouds. System call traces often have high volume. For example, an Apache web server can produce tens of thousands of system calls per second. Performing fault localization from such a large amount of system calls is like finding a needle in a haystack. We provide online system call trace analysis algorithms that can quickly extract useful fault features for us to identify whether the fault is external or internal. We propose to extract a fault impact factor feature and a fault onset time dispersion feature to infer whether the performance anomaly is caused by an external or internal fault. When a performance anomaly occurs, the PerfCompass analysis module is dynamically triggered to perform online fault localization within the faulty VMs. We can identify faulty VMs using existing online black-box fault localization tools [22, 3, 9, 18, 24]. However, to differentiate between external and internal faults, we cannot treat the whole application VM as one black-box. PerfCompass first extracts different groups of closely related system calls, called execution units, from the continuous raw system call traces. We can first easily group system calls based on the process/thread ID. However, we observe that some server systems (e.g., Apache web server) use a pre-allocated thread pool and often reuse the same thread to perform different tasks. So we further split per-thread execution units based on the inter-system-call time gap (e.g., 99th percentile value) to mitigate inaccurate system call grouping caused by thread reuse. After extracting different execution units, we perform change detection over system call execution time or system call frequency moving average values to detect whether an execution unit is affected by the fault. We then derive a fault onset time for each affected execution unit to quantify how quickly the fault impacts the affected execution unit. Our key observation is that if the performance anomaly is caused by an external fault, we often see the fault directly affects all running execution units simultaneously. In contrast, an internal fault only directly affects a subset of execution units at the beginning although the impact might propagate to other execution units through communications or shared resources at a later time. Thus, we use the impact factor to quantify the scope of the direct impact from the fault to different execution units and the fault onset time dispersion to quantify the onset time difference among different affected execution units. We can then infer whether the performance anomaly is caused by an external or internal fault by checking whether the fault has a global or local direct impact and whether the fault onset time durations of different execution units are similar. In this paper, we make the following contributions: We make the first step toward runtime fine-grained performance anomaly fault localization in IaaS clouds using low-overhead kernel-level system call tracing and analysis techniques. We describe an online system call trace analysis algorithms that can extract useful fault features (e.g., fault impact scope, fault onset time dispersion) from massive raw system call traces quickly. We have implemented the PerfCompass system and tested it using five open source server systems under common environment issues and real software bugs. The results show that PerfCompass can successfully diagnose all the tested faults while imparting an average of 1.03% runtime overhead. 2 System Design In this section, we present the design details of the Perf- Compass system. 2.1 Fault Onset Time Identification In order to distinguish between external and internal fault, we first extract two pieces of important information from raw system call traces of each execution unit: 1) which execution units are affected by the fault? and 2) when does the fault impact first start in each execution unit? To detect whether an execution unit is affected by the fault, we first analyze the system call execution time. 2
3 Our observation is that if the fault causes application performance slowdown, the system call execution time should also increase. Therefore we compute the persystem call execution time and detect whether the fault affects the execution unit by identifying any outlier system call execution times. We perform outlier detection using the standard criteria (i.e., > mean + 2 standard deviation). We also observe that different system calls often have varied execution time based on their functions. Thus, it is necessary to distinguish different kinds of system calls (e.g., sys read, sys write) and compute the persystem call execution time for each kind of system call separately. If any system call type is identified as an outlier, we infer that this execution unit has been affected by the fault. Some performance anomalies do not manifest as changes in system call execution time. For example, if the performance anomaly is caused by an incorrect loop bug, we might observe the system calls inside the loop iterate more times when the bug is triggered than during normal execution. So, we maintain a frequency count for each system call type. An execution unit is said to be affected by the fault if we detect anomalous changes in either the system call execution time or the frequency for any type of system call. If an execution unit is affected by the fault, we use a fault onset time metric to quantify how fast the fault affects the execution unit. We calculate the fault onset time using the time interval between the start time of the execution unit and the timestamp of the first affected system call in that unit. This time interval represents when the execution unit starts to be affected by the currently occurring fault after the execution unit is scheduled to run. 2.2 Fault Localization Schemes Our fault localization schemes are based on extracting and analysing two fault features. We first extract the fault impact feature to infer whether the fault has a global or local impact. We define a fault impact factor metric to calculate the percentage of threads that are affected by the fault directly. If a thread consists of multiple execution units, we only consider the fault onset time of the first execution unit that is affected by the fault since we want to identify when the fault first affected each thread. We discuss how to set the fault onset time threshold later in this section. If the fault onset time of the affected thread is smaller than a pre-defined threshold (e.g., 1 second), we say this thread is affected by the fault directly. If the fault impact factor is close to 100%, we can infer that the fault should be an external fault; If the fault impact factor is significantly less than 100%, we can infer that the fault should be an internal fault. However, if the fault impact factor has a borderline value (e.g., 90%), we extract a fault onset time dispersion feature to identify whether the fault affects different threads at the same time or at different time. Since an external fault affects all the running threads uniformly, the affected threads are likely to show the fault impact at a similar time after the fault is triggered. We use the standard deviation of the fault onset time among all the affected threads to quantify the fault onset time dispersion. If the fault onset dispersion is small, we infer that this fault is an external one. In contrast, an internal fault is likely to directly affect a subset of threads executing the buggy code and then indirectly affect other threads that communicate with the directly affected threads. Thus, if we observe a large fault onset time dispersion, we infer the fault is an internal one. In our experiments, we found that a fixed fault onset time threshold (1 second) works well for all the applications and faults we tested. Generally speaking, it is more accurate to apply an application specific threshold. We observe that the fault onset time of different external faults are similar for the same application. Thus, we can easily calibrate the fault onset time threshold for each application by imposing a simple external fault to the application (e.g., imposing a low CPU cap by limiting the CPU consumption of the application). Such a calibration can be easily done without requiring any specific workload or fault type. We can calibrate the fault onset time dispersion threshold in a similar way. 3 Evaluation We evaluate PerfCompass using real system performance anomalies caused by different external and internal faults. We first describe our experiment setup followed by a summary of our results. 3.1 Experiment Setup We tested PerfCompass with five different systems: Apache [6], MySQL [20], Squid [29], Cassandra [7], and Hadoop [8]. Table 1 lists all the faults we tested. Each of the 9 external faults represents a common multitenancy or environment issue such as interference from other co-located applications, insufficient resource allocation, or network packet loss. We also tested 7 internal faults which are real software bugs found by searching the appropriate bug reporting repository (e.g., Bugzilla) for performance related terms. We then follow the instructions given in the bug report to reproduce the bugs. We use Apache , MySQL , Squid 3.2.9, Cassandra beta, and Hadoop alpha for the 3
4 System name Fault description Apache Packet loss problem (external): using the tc command to randomly drop 10% packets. Flag setting bug (internal): deleting a port Apache is listening to and then restarting the server causes Apache to attempt to make a blocking call on a socket when a flag preventing blocking calls has not been cleared. The code does not check for this condition and re-tries endlessly (#37680). MySQL I/O interference problem (external): a co-located VM causes a disk contention interference problem by running a disk intensive Hadoop job. Deadlock bug (internal): a MySQL deadlock bug that occurs when each of the two connections locks one table and tries to lock the other table. If one connection tries to execute a INSERT DELAYED command on the other while the other is sleeping, the system will become deadlocked (#54332). Data flushing bug (internal): truncating a table causes a 5x slowdown in table insertions due to a bug with the InnoDB storage engine for big datasets. InnoDB fails to mark truncated data as deleted and constantly allocates new blocks. (#65615) Packet loss problem (external): using the tc command to randomly drop 10% packets. Squid File access bug (internal): if /dev/null is made not accessible, by changing permissions for example, squid will loop endlessly trying to open it (#1484). Cassandra I/O interference problem (external): a co-located VM causes a disk contention interference problem by running a disk intensive Hadoop job. Endless loop bug (internal): trying to alter a table when the table includes collections causes it to hang, consuming 100% CPU due to an internal problem with the way Cassandra locks tables for updating (#5064). I/O interference problem (external): a co-located VM causes a disk contention interference problem by running a disk intensive Hadoop job. Hadoop Endless read bug (internal): HDFS does not check for an overflow of an internal content length field causing HDFS transfers larger than 2GB to block and continuously try to read from an input stream (#HDFS-3318). Thread shutdown bug (internal): when the AppLogAggregator thread dies unexpectedly (e.g. due to a crash), the task waits for an atomic variable to be set indicating thread shutdown is complete. As the thread has died already, the variable will never be set and the job will hang indefinitely (# MAPREDUCE-3738). Fault impact factor Fault onset time dispersion 100 ± 0 % 7 ± 1 ms 100 ± 0 % 4 ± 1 ms 50 ± 0.5 % 374 ± 63 ms 100 ± 0 % 15 ± 7 ms 94 ± 2 % ± 4 ms 40 ± 0% 38 ± 3 ms 62 ± 3 % 721 ± 4 ms 100 ± 0 % 0.01 ± ms 83 ± 1 % 0.35 ± 0.09 ms 99 ± 1.4 % 28 ± 4 ms 100 ± 0 % 9 ± 1 ms 51 ± 5.7% 25 ± 0.98 ms 98 ± 1 % 39 ± 5 ms 98 ± 0 % 16 ± 3 ms 81 ± 0 % 23 ± 6 ms 85 ± 0.5 % 110 ± 20 ms Correct diagnosis Table 1: PerfCompass fault localization result summary for different systems under various external and internal faults. Apache, MySQL, Squid, Cassandra, and Hadoop external faults, respectively. For each internal fault, we used the version specified in the bug report. In order to evaluate PerfCompass under workloads with realistic time variations, we used the following workloads during our experiments: 1) Apache: we use one day of per-minute workload intensity observed in a NASA web server trace starting at :00.00 [2]; 2) MySQL: we use a MySQL benchmark tool called Sysbench [1]; 3) Squid: we configure Squid to act as a web proxy and use httperf to request various pages from an Apache web server using Squid as the proxy; 4) Cassandra: we use a workload which creates a table and inserts various entries into the table; and 5) Hadoop: we use the standard Pi calculation example with 16 map and 16 reduce tasks. We repeated each fault injection three times, reporting the average and mean standard deviation values over all 3 runs. To evaluate our approach in different virtualization environments, we conducted our experiments on two different virtualized clusters. The Apache and MySQL experiments were conducted on a cluster where each host has a dual-core Xeon 3.0GHz CPU and 4GB memory, and runs 64bit CentOS 5.2 with Xen The Squid, Cassandra, and Hadoop experiments were conducted on a cluster where each node is equipped with a quad-core Xeon 2.53Ghz CPU along with 8GB memory with KVM 1.0. In both cases, each trace was collected on a guest VM using LTTng running 32-bit Ubuntu kernel version
5 We calibrate the fault onset time threshold and fault onset time dispersion threshold values by injecting a low CPU cap external fault to different applications. Note that we do not need to perform the calibration under the same workload intensity as the performance anomaly occurrence time. The fault onset time thresholds we get are 0.06 second for Apache, 0.3 second for MySQL, second for Squid, and 0.4 second for Cassandra and Hadoop. The fault onset time dispersion values we get are 7ms for Apache, 17ms for MySQL, 0.02ms for Squid, 28ms for Cassandra, and 39ms for Hadoop. We also conduct the experiments using a fixed fault onset time threshold (1 second) and found that using the simple threshold gives similar results. 3.2 Result Summary Table 1 provides a summary of our fault localization results for different systems under various external and internal faults. The results show that PerfCompass correctly diagnoses each of the 9 external faults as external and each of the 7 internal faults as internal. All the tested external faults have impact factors close to 100%, clearly indicating each of them as external. Similarly, most internal faults for these systems have an impact factor significantly less than 100%, clearly indicating each of them as internal. We also observe that most internal faults have significantly larger fault onset time dispersion values than the external faults of the same application PerfCompass Overhead We now evaluate the overhead of the PerfCompass system. Figure 1 shows the runtime overhead imposed by PerfCompass on each of the tested systems. For Apache, MySQL, and Squid we used httperf to send a fixed number of requests, recording the average response time both with and without PerfCompass. We used a request rate of 100 requests per second for Apache and Squid, and a request rate of 20 requests per second for MySQL 1. For Cassandra, we ran a simple database insertion workload, recording the average processing time. For Hadoop we ran the Pi sample job, recording the average processing time. We ran all the tests 5 times, reporting the mean and standard deviation. We observe that PerfCompass imparts 1.03% runtime overhead on average. We also measured the resource consumption of PerfCompass. We found PerfCompass imparts between 2-3% CPU load and has a small memory footprint ( about 256KB). We believe that PerfCompass is light-weight, which makes it 1 Those request rates were selected based on the capacity of the tested machine. Normalized processing time (%) Without PerfCompass With PerfCompass Apache MySQL Squid Cassandra Hadoop Figure 1: The overhead imposed by PerfCompass on different server systems. practical for online performance anomaly fault localization in production IaaS clouds. 4 Conclusions and Future Work In this paper, we have presented PerfCompass, a runtime performance anomaly fault localization tool for production IaaS clouds. PerfCompass can distinguish between external and internal faults without requiring source code or runtime instrumentation. PerfCompass uses light-weight kernel-level system call tracing and performs online system call trace analysis to extract fault features from massive raw system call traces during runtime. We have implemented PerfCompass and evaluated it using a variety of commonly used open source server systems including Apache, MySQL, Squid, Cassandra, and Hadoop. We tested PerfCompass using a set of common environment issues in the shared IaaS clouds and real software bugs. The results show that PerfCompass accurately diagnoses all the tested faults. PerfCompass is light-weight and non-intrusive, which makes it practical for IaaS clouds. In the future, we plan to extend our system call analysis framework to extract other fault features for supporting different types of fine-grained runtime performance anomaly fault localizations such as narrowing down potential root cause functions. Acknowledgment We thank the anonymous reviewers for their valuable comments. This work was sponsored in part by NSF CNS grant, NSF CNS grant, NSF CA- REER Award CNS , U.S. Army Research Office (ARO) under grant W911NF , IBM Faculty Awards and Google Research Awards. Any opinions expressed in this paper are those of the authors and do not necessarily reflect the views of NSF, ARO, or U.S. Government. 5
6 References [1] SysBench: a system performance benchmark. [2] The IRCache Project. [3] M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. In SIGOPS, [4] Amazon elastic compute cloud. ec2/. [5] Amazon elastic compute cloud service outage. http: //money.cnn.com/2011/04/21/technology/amazon_ server_outage/. [6] Apache. [7] Apache Cassandra. [8] Apache Hadoop. [9] P. Bahl, P. Barham, R. Black, R. Chandra, M. Goldszmidt, R. Isaacs, S. Kandula, L. Li, J. MacCormick, D. A. Maltz, R. Mortier, M. Wawrzoniak, and M. Zhang. Discovering dependencies for network management. In Hotnets, [10] P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using magpie for request extraction and workload modelling. In OSDI, [11] A. Chanda, A. L. Cox, and W. Zwaenepoel. Whodunit: Transactional profiling for multi-tier applications. In SIGOPS, [12] M. Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer. Pinpoint: Problem determination in large, dynamic internet services. In DSN, [13] C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul, C. Limpach, I. Pratt, and A. Warfield. Live migration of virtual machines. In NSDI, [14] M. Desnoyers and M. R. Dagenais. The LTTng tracer: A low impact performance and behavior monitor for gnu/linux. In Linux Symposium, [15] X. Ding, H. Huang, Y. Ruan, A. Shaikh, and X. Zhang. Automatic software fault diagnosis by exploiting application signatures. In LISA, [16] U. Erlingsson, M. Peinado, S. Peter, and M. Budiu. Fay: Extensible distributed tracing from kernels to clusters. In SOSP, [17] R. Fonseca, G. Porter, R. H. Katz, S. Shenker, and I. Stoica. X- trace: A pervasive network tracing framework. In NSDI, [18] A. Hussain, G. Bartlett, Y. Pryadkin, J. Heidemann, C. Papadopoulos, and J. Bannister. Experiences with a continuous network tracing infrastructure. In SIGCOMM workshop on mining network data, [19] G. Jin, L. Song, X. Shi, J. Scherpelz, and S. Lu. Understanding and detecting real-world performance bugs. In PLDI, [20] MySQL. [21] H. Nguyen, Z. Shen, X. Gu, S. Subbiah, and J. Wilkes. AGILE: elastic distributed resource scaling for infrastructure-as-a-service. In ICAC, [22] H. Nguyen, Z. Shen, Y. Tan, and X. Gu. Fchain: Toward blackbox online fault localization for cloud systems. In ICDCS, [23] P. Reynolds, C. Killian, J. L. Wiener, J. C. Mogul, M. A. Shah, and A. Vahdat. Pip: Detecting the unexpected in distributed systems. In NSDI, [24] P. Reynolds, J. L. Wiener, J. C. Mogul, M. K. Aguilera, and A. Vahdat. WAP5: black-box performance debugging for widearea systems. In WWW, [25] R. R. Sambasivan, A. X. Zheng, M. D. Rosa, E. Krevat, S. Whitman, M. Stroucken, W. Wang, L. Xu, and G. R. Ganger. Diagnosing performance changes by comparing request flows. In NSDI, [26] A. B. Sharma, H. Chen, M. Ding, K. Yoshihira, and G. Jiang. Fault detection and localization in distributed systems using invariant relationships. In DSN, [27] K. Shen, C. Stewart, C. Li, and X. Li. Reference-driven performance anomaly identification. In SIGMETRICS, [28] Z. Shen, S. Subbiah, X. Gu, and J. Wilkes. CloudScale: elastic resource scaling for multi-tenant cloud systems. In SOCC, [29] Squid-cache. [30] A. Traeger, I. Deras, and E. Zadok. DARC: Dynamic analysis of root causes of latency distributions. In SIGMETRICS, [31] M. Yu, A. Greenberg, D. Maltz, J. Rexford, L. Yuan, S. Kandula, and C. Kim. Profiling network performance for multi-tier data center applications. In NSDI,
FChain: Toward Black-box Online Fault Localization for Cloud Systems
2013 IEEE 33rd International Conference on Distributed Computing Systems FChain: Toward Black-box Online Fault Localization for Cloud Systems Hiep Nguyen, Zhiming Shen, Yongmin Tan, Xiaohui Gu North Carolina
CloudScale: Elastic Resource Scaling for Multi-Tenant Cloud Systems
CloudScale: Elastic Resource Scaling for Multi-Tenant Cloud Systems Zhiming Shen, Sethuraman Subbiah, Xiaohui Gu, Department of Computer Science North Carolina State University {zshen5,ssubbia2}@ncsu.edu,
PerfScope: Practical Online Server Performance Bug Inference in Production Cloud Computing Infrastructures
PerfScope: Practical Online Server Performance Bug Inference in Production Cloud Computing Infrastructures Daniel J Dean Hiep Nguyen Xiaohui Gu North Carolina State University {djdean2,hcnguye3}@ncsuedu,
UBL: Unsupervised Behavior Learning for Predicting Performance Anomalies in Virtualized Cloud Systems
UBL: Unsupervised Behavior Learning for Predicting Performance Anomalies in Virtualized Cloud Systems Daniel J. Dean, Hiep Nguyen, Xiaohui Gu Department of Computer Science North Carolina State University
Large-scale performance monitoring framework for cloud monitoring. Live Trace Reading and Processing
Large-scale performance monitoring framework for cloud monitoring Live Trace Reading and Processing Julien Desfossez Michel Dagenais May 2014 École Polytechnique de Montreal Live Trace Reading Read the
Software System Performance Debugging with Kernel Events Feature Guidance
Software System Performance Debugging with Kernel Events Feature Guidance Junghwan Rhee, Hui Zhang, Nipun Arora, Guofei Jiang, Kenji Yoshihira NEC Laboratories America Princeton, New Jersey 08540 USA Email:
PREPARE: Predictive Performance Anomaly Prevention for Virtualized Cloud Systems
: Predictive Performance Anomaly Prevention for Virtualized Cloud Systems Yongmin Tan, Hiep Nguyen, Zhiming Shen, Xiaohui Gu North Carolina State University Raleigh, NC, USA Email: {ytan2,hcnguye3,zshen5}@ncsu.edu,
Black-box and Gray-box Strategies for Virtual Machine Migration
Black-box and Gray-box Strategies for Virtual Machine Migration Wood, et al (UMass), NSDI07 Context: Virtual Machine Migration 1 Introduction Want agility in server farms to reallocate resources devoted
GraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
EC2 Performance Analysis for Resource Provisioning of Service-Oriented Applications
EC2 Performance Analysis for Resource Provisioning of Service-Oriented Applications Jiang Dejun 1,2 Guillaume Pierre 1 Chi-Hung Chi 2 1 VU University Amsterdam 2 Tsinghua University Beijing Abstract. Cloud
Figure 1. The cloud scales: Amazon EC2 growth [2].
- Chung-Cheng Li and Kuochen Wang Department of Computer Science National Chiao Tung University Hsinchu, Taiwan 300 [email protected], [email protected] Abstract One of the most important issues
Characterizing Task Usage Shapes in Google s Compute Clusters
Characterizing Task Usage Shapes in Google s Compute Clusters Qi Zhang 1, Joseph L. Hellerstein 2, Raouf Boutaba 1 1 University of Waterloo, 2 Google Inc. Introduction Cloud computing is becoming a key
Xen Live Migration. Networks and Distributed Systems Seminar, 24 April 2006. Matúš Harvan Xen Live Migration 1
Xen Live Migration Matúš Harvan Networks and Distributed Systems Seminar, 24 April 2006 Matúš Harvan Xen Live Migration 1 Outline 1 Xen Overview 2 Live migration General Memory, Network, Storage Migration
Affinity Aware VM Colocation Mechanism for Cloud
Affinity Aware VM Colocation Mechanism for Cloud Nilesh Pachorkar 1* and Rajesh Ingle 2 Received: 24-December-2014; Revised: 12-January-2015; Accepted: 12-January-2015 2014 ACCENTS Abstract The most of
Efficient and Large-Scale Infrastructure Monitoring with Tracing
CloudOpen Europe 2013 Efficient and Large-Scale Infrastructure Monitoring with Tracing [email protected] 1 Content Overview of tracing and LTTng LTTng features for Cloud Providers LTTng as a
Two-Level Cooperation in Autonomic Cloud Resource Management
Two-Level Cooperation in Autonomic Cloud Resource Management Giang Son Tran, Laurent Broto, and Daniel Hagimont ENSEEIHT University of Toulouse, Toulouse, France Email: {giang.tran, laurent.broto, daniel.hagimont}@enseeiht.fr
1.1 Difficulty in Fault Localization in Large-Scale Computing Systems
Chapter 1 Introduction System failures have been one of the biggest obstacles in operating today s largescale computing systems. Fault localization, i.e., identifying direct or indirect causes of failures,
A Dynamic Resource Management with Energy Saving Mechanism for Supporting Cloud Computing
A Dynamic Resource Management with Energy Saving Mechanism for Supporting Cloud Computing Liang-Teh Lee, Kang-Yuan Liu, Hui-Yang Huang and Chia-Ying Tseng Department of Computer Science and Engineering,
Deploying Business Virtual Appliances on Open Source Cloud Computing
International Journal of Computer Science and Telecommunications [Volume 3, Issue 4, April 2012] 26 ISSN 2047-3338 Deploying Business Virtual Appliances on Open Source Cloud Computing Tran Van Lang 1 and
Evaluating HDFS I/O Performance on Virtualized Systems
Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang [email protected] University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing
Snapshots in Hadoop Distributed File System
Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any
A Middleware Strategy to Survive Compute Peak Loads in Cloud
A Middleware Strategy to Survive Compute Peak Loads in Cloud Sasko Ristov Ss. Cyril and Methodius University Faculty of Information Sciences and Computer Engineering Skopje, Macedonia Email: [email protected]
A Migration of Virtual Machine to Remote System
ISSN (Online) : 2319-8753 ISSN (Print) : 2347-6710 International Journal of Innovative Research in Science, Engineering and Technology Volume 3, Special Issue 3, March 2014 2014 International Conference
Exploring Amazon EC2 for Scale-out Applications
Exploring Amazon EC2 for Scale-out Applications Presented by, MySQL & O Reilly Media, Inc. Morgan Tocker, MySQL Canada Carl Mercier, Defensio Introduction! Defensio is a spam filtering web service for
Resource Scalability for Efficient Parallel Processing in Cloud
Resource Scalability for Efficient Parallel Processing in Cloud ABSTRACT Govinda.K #1, Abirami.M #2, Divya Mercy Silva.J #3 #1 SCSE, VIT University #2 SITE, VIT University #3 SITE, VIT University In the
Exploring Graph Analytics for Cloud Troubleshooting
Exploring Graph Analytics for Cloud Troubleshooting Chengwei Wang, Karsten Schwan, Brian Laub, Mukil Kesavan, and Ada Gavrilovska, Georgia Institute of Technology https://www.usenix.org/conference/icac14/technical-sessions/presentation/wang
MAGENTO HOSTING Progressive Server Performance Improvements
MAGENTO HOSTING Progressive Server Performance Improvements Simple Helix, LLC 4092 Memorial Parkway Ste 202 Huntsville, AL 35802 [email protected] 1.866.963.0424 www.simplehelix.com 2 Table of Contents
Abstract. 1 Introduction. 2 Problem Statement. 1 Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop
1 Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop Jiaqi Tan, Xinghao Pan, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan Electrical & Computer Engineering Department Carnegie Mellon University,
Dynamic resource management for energy saving in the cloud computing environment
Dynamic resource management for energy saving in the cloud computing environment Liang-Teh Lee, Kang-Yuan Liu, and Hui-Yang Huang Department of Computer Science and Engineering, Tatung University, Taiwan
PipeCloud : Using Causality to Overcome Speed-of-Light Delays in Cloud-Based Disaster Recovery. Razvan Ghitulete Vrije Universiteit
PipeCloud : Using Causality to Overcome Speed-of-Light Delays in Cloud-Based Disaster Recovery Razvan Ghitulete Vrije Universiteit Introduction /introduction Ubiquity: the final frontier Internet needs
Ganesha: Black-Box Diagnosis of MapReduce Systems
Ganesha: Black-Box Diagnosis of MapReduce Systems Xinghao Pan, Jiaqi Tan, Soila Kavulya, Rajeev Gandhi, Priya Narasimhan Electrical & Computer Engineering Department, Carnegie Mellon University Pittsburgh,
Cloud Computing through Virtualization and HPC technologies
Cloud Computing through Virtualization and HPC technologies William Lu, Ph.D. 1 Agenda Cloud Computing & HPC A Case of HPC Implementation Application Performance in VM Summary 2 Cloud Computing & HPC HPC
Cloud Computing Workload Benchmark Report
Cloud Computing Workload Benchmark Report Workload Benchmark Testing Results Between ProfitBricks and Amazon EC2 AWS: Apache Benchmark, nginx Benchmark, SysBench, pgbench, Postmark October 2014 TABLE OF
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
Software Execution Protection in the Cloud
Software Execution Protection in the Cloud Miguel Correia 1st European Workshop on Dependable Cloud Computing Sibiu, Romania, May 8 th 2012 Motivation clouds fail 2 1 Motivation accidental arbitrary faults
USING VIRTUAL MACHINE REPLICATION FOR DYNAMIC CONFIGURATION OF MULTI-TIER INTERNET SERVICES
USING VIRTUAL MACHINE REPLICATION FOR DYNAMIC CONFIGURATION OF MULTI-TIER INTERNET SERVICES Carlos Oliveira, Vinicius Petrucci, Orlando Loques Universidade Federal Fluminense Niterói, Brazil ABSTRACT In
Marvell DragonFly Virtual Storage Accelerator Performance Benchmarks
PERFORMANCE BENCHMARKS PAPER Marvell DragonFly Virtual Storage Accelerator Performance Benchmarks Arvind Pruthi Senior Staff Manager Marvell April 2011 www.marvell.com Overview In today s virtualized data
Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)
UC BERKELEY Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II) Anthony D. Joseph LASER Summer School September 2013 My Talks at LASER 2013 1. AMP Lab introduction 2. The Datacenter
Virtualization and Cloud Computing. The Threat of Covert Channels. Related Work. Zhenyu Wu, Zhang Xu, and Haining Wang 1
Virtualization and Cloud Computing Zhenyu Wu, Zhang Xu, Haining Wang William and Mary Now affiliated with NEC Laboratories America Inc. Server Virtualization Consolidates workload Simplifies resource management
Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)
Open Cloud System (Integration of Eucalyptus, Hadoop and into deployment of University Private Cloud) Thinn Thu Naing University of Computer Studies, Yangon 25 th October 2011 Open Cloud System University
A Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
System Behavior Analysis by Machine Learning
CSC456 OS Survey Yuncheng Li [email protected] December 6, 2012 Table of contents 1 Motivation Background 2 3 4 Table of Contents Motivation Background 1 Motivation Background 2 3 4 Scenarios Motivation
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
Liferay Portal Performance. Benchmark Study of Liferay Portal Enterprise Edition
Liferay Portal Performance Benchmark Study of Liferay Portal Enterprise Edition Table of Contents Executive Summary... 3 Test Scenarios... 4 Benchmark Configuration and Methodology... 5 Environment Configuration...
Improving MapReduce Performance in Heterogeneous Environments
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University of California at Berkeley Motivation 1. MapReduce
Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
EWeb: Highly Scalable Client Transparent Fault Tolerant System for Cloud based Web Applications
ECE6102 Dependable Distribute Systems, Fall2010 EWeb: Highly Scalable Client Transparent Fault Tolerant System for Cloud based Web Applications Deepal Jayasinghe, Hyojun Kim, Mohammad M. Hossain, Ali Payani
This is an author-deposited version published in : http://oatao.univ-toulouse.fr/ Eprints ID : 12902
Open Archive TOULOUSE Archive Ouverte (OATAO) OATAO is an open access repository that collects the work of Toulouse researchers and makes it freely available over the web where possible. This is an author-deposited
Variations in Performance and Scalability when Migrating n-tier Applications to Different Clouds
Variations in Performance and Scalability when Migrating n-tier Applications to Different Clouds Deepal Jayasinghe, Simon Malkowski, Qingyang Wang, Jack Li, Pengcheng Xiong, Calton Pu Outline Motivation
Group Based Load Balancing Algorithm in Cloud Computing Virtualization
Group Based Load Balancing Algorithm in Cloud Computing Virtualization Rishi Bhardwaj, 2 Sangeeta Mittal, Student, 2 Assistant Professor, Department of Computer Science, Jaypee Institute of Information
Virtual Technologies for Learning System. Chao-Wen Chan, Chih-Min Chen. National Taichung University of Science and Technology, Taiwan
Virtual Technologies for Learning System Chao-Wen Chan, Chih-Min Chen 0274 National Taichung University of Science and Technology, Taiwan The Asian Conference on Technology in the Classroom 2012 2012 Abstract:
Cloud Operating Systems for Servers
Cloud Operating Systems for Servers Mike Day Distinguished Engineer, Virtualization and Linux August 20, 2014 [email protected] 1 What Makes a Good Cloud Operating System?! Consumes Few Resources! Fast
Monitoring Elastic Cloud Services
Monitoring Elastic Cloud Services [email protected] Advanced School on Service Oriented Computing (SummerSoc 2014) 30 June 5 July, Hersonissos, Crete, Greece Presentation Outline Elasticity in Cloud
The public-cloud-computing market has
View from the Cloud Editor: George Pallis [email protected] Comparing Public- Cloud Providers Ang Li and Xiaowei Yang Duke University Srikanth Kandula and Ming Zhang Microsoft Research As cloud computing
COMMA: Coordinating the Migration of Multi-tier Applications
COMMA: Coordinating the Migration of Multi-tier Applications Jie Zheng T. S. Eugene Ng Kunwadee Sripanidkulchai Zhaolei Liu Rice University NECTEC, Thailand Abstract Multi-tier applications are widely
PERFORMANCE ANALYSIS OF KERNEL-BASED VIRTUAL MACHINE
PERFORMANCE ANALYSIS OF KERNEL-BASED VIRTUAL MACHINE Sudha M 1, Harish G M 2, Nandan A 3, Usha J 4 1 Department of MCA, R V College of Engineering, Bangalore : 560059, India [email protected] 2 Department
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
IaaS Cloud Architectures: Virtualized Data Centers to Federated Cloud Infrastructures
IaaS Cloud Architectures: Virtualized Data Centers to Federated Cloud Infrastructures Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Introduction
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National
Dynamic Load Balancing of Virtual Machines using QEMU-KVM
Dynamic Load Balancing of Virtual Machines using QEMU-KVM Akshay Chandak Krishnakant Jaju Technology, College of Engineering, Pune. Maharashtra, India. Akshay Kanfade Pushkar Lohiya Technology, College
BSPCloud: A Hybrid Programming Library for Cloud Computing *
BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China [email protected],
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Virtual Switching Without a Hypervisor for a More Secure Cloud
ing Without a for a More Secure Cloud Xin Jin Princeton University Joint work with Eric Keller(UPenn) and Jennifer Rexford(Princeton) 1 Public Cloud Infrastructure Cloud providers offer computing resources
Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1
Performance Study Performance Characteristics of and RDM VMware ESX Server 3.0.1 VMware ESX Server offers three choices for managing disk access in a virtual machine VMware Virtual Machine File System
Migration of Virtual Machines for Better Performance in Cloud Computing Environment
Migration of Virtual Machines for Better Performance in Cloud Computing Environment J.Sreekanth 1, B.Santhosh Kumar 2 PG Scholar, Dept. of CSE, G Pulla Reddy Engineering College, Kurnool, Andhra Pradesh,
HPC performance applications on Virtual Clusters
Panagiotis Kritikakos EPCC, School of Physics & Astronomy, University of Edinburgh, Scotland - UK [email protected] 4 th IC-SCCE, Athens 7 th July 2010 This work investigates the performance of (Java)
Performance of the Cloud-Based Commodity Cluster. School of Computer Science and Engineering, International University, Hochiminh City 70000, Vietnam
Computer Technology and Application 4 (2013) 532-537 D DAVID PUBLISHING Performance of the Cloud-Based Commodity Cluster Van-Hau Pham, Duc-Cuong Nguyen and Tien-Dung Nguyen School of Computer Science and
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions
Adaptive Probing: A Monitoring-Based Probing Approach for Fault Localization in Networks
Adaptive Probing: A Monitoring-Based Probing Approach for Fault Localization in Networks Akshay Kumar *, R. K. Ghosh, Maitreya Natu *Student author Indian Institute of Technology, Kanpur, India Tata Research
VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5
Performance Study VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5 VMware VirtualCenter uses a database to store metadata on the state of a VMware Infrastructure environment.
An Experimental Study of Load Balancing of OpenNebula Open-Source Cloud Computing Platform
An Experimental Study of Load Balancing of OpenNebula Open-Source Cloud Computing Platform A B M Moniruzzaman 1, Kawser Wazed Nafi 2, Prof. Syed Akhter Hossain 1 and Prof. M. M. A. Hashem 1 Department
Infrastructure as a Service (IaaS)
Infrastructure as a Service (IaaS) (ENCS 691K Chapter 4) Roch Glitho, PhD Associate Professor and Canada Research Chair My URL - http://users.encs.concordia.ca/~glitho/ References 1. R. Moreno et al.,
STeP-IN SUMMIT 2013. June 18 21, 2013 at Bangalore, INDIA. Performance Testing of an IAAS Cloud Software (A CloudStack Use Case)
10 th International Conference on Software Testing June 18 21, 2013 at Bangalore, INDIA by Sowmya Krishnan, Senior Software QA Engineer, Citrix Copyright: STeP-IN Forum and Quality Solutions for Information
Method of Fault Detection in Cloud Computing Systems
, pp.205-212 http://dx.doi.org/10.14257/ijgdc.2014.7.3.21 Method of Fault Detection in Cloud Computing Systems Ying Jiang, Jie Huang, Jiaman Ding and Yingli Liu Yunnan Key Lab of Computer Technology Application,
Inciting Cloud Virtual Machine Reallocation With Supervised Machine Learning and Time Series Forecasts. Eli M. Dow IBM Research, Yorktown NY
Inciting Cloud Virtual Machine Reallocation With Supervised Machine Learning and Time Series Forecasts Eli M. Dow IBM Research, Yorktown NY What is this talk about? This is a 30 minute technical talk about
benchmarking Amazon EC2 for high-performance scientific computing
Edward Walker benchmarking Amazon EC2 for high-performance scientific computing Edward Walker is a Research Scientist with the Texas Advanced Computing Center at the University of Texas at Austin. He received
High Availability for Database Systems in Cloud Computing Environments. Ashraf Aboulnaga University of Waterloo
High Availability for Database Systems in Cloud Computing Environments Ashraf Aboulnaga University of Waterloo Acknowledgments University of Waterloo Prof. Kenneth Salem Umar Farooq Minhas Rui Liu (post-doctoral
9/26/2011. What is Virtualization? What are the different types of virtualization.
CSE 501 Monday, September 26, 2011 Kevin Cleary [email protected] What is Virtualization? What are the different types of virtualization. Practical Uses Popular virtualization products Demo Question,
Modeling Virtual Machine Performance: Challenges and Approaches
Modeling Virtual Machine Performance: Challenges and Approaches Omesh Tickoo Ravi Iyer Ramesh Illikkal Don Newell Intel Corporation Intel Corporation Intel Corporation Intel Corporation [email protected]
Cloud Performance Benchmark Series
Cloud Performance Benchmark Series Amazon Elastic Load Balancing (ELB) Md. Borhan Uddin Bo He Radu Sion ver. 0.5b 1. Overview Experiments were performed to benchmark the Amazon Elastic Load Balancing (ELB)
Building a Private Cloud with Eucalyptus
Building a Private Cloud with Eucalyptus 5th IEEE International Conference on e-science Oxford December 9th 2009 Christian Baun, Marcel Kunze KIT The cooperation of Forschungszentrum Karlsruhe GmbH und
