Cloud Computing. and Scheduling. Data-Intensive Computing. Frederic Magoules, Jie Pan, and Fei Teng SILKQH. CRC Press. Taylor & Francis Group

Cloud Computing Data-Intensive Computing and Scheduling Frederic Magoules, Jie Pan, and Fei Teng SILKQH CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint of the Taylor & Francis Croup, an Informa business A CHAPMAN & HALL BOOK

Contents List of figures xiii List of tables xv Foreword xvii Preface xix Warranty xxv 1 Overview of cloud computing 1 1.1 Introduction 1 1.1.1 Cloud definitions 1 1.1.2 System architecture 2 1.1.3 Deployment models 3 1.1.4 Cloud characteristics 3 1.2 Cloud evolution 6 1.2.1 Getting ready for the cloud 6 1.2.2 Brief history 7 1.2.3 Comparison w ith related technologies 8 1.3 Cloud services 10 1.4 Cloud projects 12 1.4.1 Commercial products 12 1.4.2 Research projects 13 1.5 Cloud challenges 15 1.5.1 MapReduce programming model 15 1.5.2 Data management 16 1.5.3 Resource scheduling 16 1.6 Concluding remarks 17 2 Resource scheduling for cloud computing 19 2.1 Introduction 19 2.2 Cloud service scheduling hierarchy 19 2.3 Economic models for resource-allocation scheduling 21 2.3.1 Market strategies 21 2.3.2 Auction strategies 23 2.3.3 Economic schedulers 25 vii

viii 2.4 Heuristic models for task-execution scheduling 27 2.4.1 Static strategies 28 2.4.2 Dynamic strategies 30 2.4.3 Heuristic schedulers 32 2.5 Real-time scheduling in cloud computing 35 2.5.1 Fixed priority strategies 35 2.5.2 Dynamic priority strategies 37 2.5.3 Real-time schedulers 38 2.6 Concluding remarks 39 3 Game theoretical allocation in a cloud datacenter 41 3.1 Introduction 41 3.2 Game theory 42 3.2.1 Normal formulation 42 3.2.2 Payoff choice and utility function 43 3.2.3 Strategy choice and Nash equilibrium 45 3.3 Cloud resource allocation model 46 3.3.1 Bid-shared auction 46 3.3.2 Non-cooperative game 47 3.4 Nash equilibrium allocation algorithms 48 3.4.1 Bid functions 48 3.4.2 Parameters estimation 50 3.4.3 Equilibrium price 53 3.5 Implementation in a cloud datacenter 55 3.5.1 Cloudsim toolkit 55 3.5.2 Communication among entities 55 3.5.3 Bidding algorithms 57 3.5.4 Comparison of forecasting methods 57 3.6 Concluding remarks 61... 4 Multi-dimensional data analysis in a cloud datacenter 63 4.1 Introduction 63 4.2 Pre-computing 64 4.2.1 Data cube 65 4.2.2 Sparse cube 65 4.2.3 Reuse of previous query results 66 4.2.4 Data compressing 66 4.3 Data indexing 67 4.4 Data partitioning 68 4.4.1 Data partitioning methods 68 4.4.2 Horizontal partitioning of a multi-dimensional dataset 70 4.4.3 Vertical partitioning of a multi-dimensional dataset 73 4.5 Data replication 75 4.6 Query processing parallelism 75 4.6.1 Inter-and intra-operators 76

IX 4.6.2 Exchange operator 77 4.6.3 SQL operator parallelization 78 4.7 Concluding remarks 84 Data intensive applications with MapReduce 85 5.1 Introduction 85 5.2 MapReduce: New parallel computing model in cloud computing.. 86 5.2.1 Dataflow model 86 5.2.2 Two frameworks: GridGain versus Hadoop 89 5.2.3 Communication cost analysis 90 5.3 Distributed data storage underlying MapReduce 92 5.3.1 Google file system 92 5.3.2 Distributed cache memory 93 5.3.3 Data accessing 94 5.4 Large-scale data analysis based on MapReduce 95 5.4.1 Data query languages 96 5.4.2 Data analysis applications 96 5.4.3 Comparison with shared-nothing parallel databases 97 5.5 SimMapReduce: Simulator for modeling MapReduce framework 100 5.5.1 Multi-layer architecture 101 5.5.2 Input and output of simulator 102 5.5.3 Implementation details of simulator 105 5.5.4 Modeling process 107 5.6 Concluding remarks 109 Large-scale multi-dimensional data aggregation 111 6.1 Introduction Ill 6.2 Data organization Ill 6.2.1 Computations in data explorations 114 6.2.2 Multiple group-by query 117 6.3 Choosing a right MapReduce framework 118 6.3.1 Advantages of GridGain 118 6.3.2 Combiner support in Hadoop and GridGain 119 6.3.3 Realizing MapReduce applications with GridGain 119 6.3.4 Workflow analysis of GridGain procedure 120 6.4 Parallelizing single group-by query with MapReduce 122 6.5 Parallelizing multiple group-by query with MapReduce 122 6.5.1 Data partitioning and data placement 123 6.5.2 MapReduce model-based implementation 123 6.5.3 MapCombineReduce model-based implementation 126 6.6 Cost estimation 128 6.6.1 MapReduce model-based implementation 128 6.6.2 MapCombineReduce model-based implementation 131 6.6.3 Comparison of implementations 132 6.7 Concluding remarks 133

X 7 Multi-dimensional data analysis optimization 135 7.1 Introduction 135 7.2 Data-locating based job-scheduling 135 7.2.1 Job-scheduling implementation 136 7.2.2 Two-level scheduling 136 7.2.3 Alternative job-scheduling schemes 137 7.3 Improvements by speed-up measurements 137 7.3.1 Horizontal partitioning 138 7.3.2 Vertical partitioning 141 7.4 Improvements by affecting factors 143 7.4.1 Query selectivity 144 7.4.2 Side effects 144 7.5 Improvement by cost estimation 145 7.5.1 Horizontal partitioning 146 7.5.2 Vertical partitioning 149 7.5.3 Comparison of partitioning 150 7.6 Compressed data structures 151 7.6.1 Data structure description 151 7.6.2 Data structures for storing recordld-list 152 7.6.3 Compressed... data structures for different dimensions 152 7.6.4 Bitmap sparcity and compressing 154 7.7 Concluding remarks 155 8 Real-time scheduling with MapReduce 157 8.1 Introduction 157 8.2 Real-time scheduling problem 158 8.2.1 Real-time task 158 8.2.2 Processing resource 159 8.2.3 Scheduling algorithms 160 8.3 Schedulability test in the cloud datacenter 160 8.3.1 Pseudo-polynomial complexity 161 8.3.2 Polynomial complexity 162 8.3.3 Constant complexity 163 8.4 Utilization bounds for schedulability testing 164 8.4.1 Classical bound 164 8.4.2 Closer periods 165 8.4.3 Harmonic chains 165 8.4.4 Hyperbolic bound 165 8.5 Real-time task scheduling with MapReduce 166 8.5.1 System model 166 8.5.2 MapReduce segmentation 167 8.5.3 Worst pattern for a schedulable task set 168 8.6 Reliability indication methods 174 8.6.1 Reliability indicator 174 8.6.2 Schedulability test conditions 176

xi 8.6.3 Comparison of rate monotonic conditions 177 8.6.4 Comparison of deadline monotonic conditions 178 8.7 Concluding remarks 182 9 Future for cloud computing 183 Bibliography 187 Index 203