1 J Grid Computing DOI /s Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions Jawwad Shamsi Muhammad Ali Khojaye Mohammad Ali Qasmi Received: 6 February 2012 / Accepted: 28 March 2013 Springer Science+Business Media Dordrecht 2013 Abstract Data-intensive systems encompass terabytes to petabytes of data. Such systems require massive storage and intensive computational power in order to execute complex queries and generate timely results. Further, the rate at which this data is being generated induces extensive challenges of data storage, linking, and processing. A data-intensive cloud provides an abstraction of high availability, usability, and efficiency to users. However, underlying this abstraction, there are stringent requirements and challenges to facilitate scalable and resourceful services through effective physical infrastructure, smart networking solutions, intelligent software tools, and useful software approaches. This paper analyzes the extensive requirements which exist in data-intensive clouds, describes various challenges related to the paradigm, and assess numerous solutions in meeting these requirements and challenges. It provides a detailed study of the solutions and analyzes their capabilities in meeting emerging needs of widespread applications. J. Shamsi (B) M. A. Khojaye M. A. Qasmi Systems Research Laboratory, FAST-National University of Computer and Emerging Sciences, Karachi, Pakistan Keywords Data-intensive cloud computing Scalability Fault tolerance Heterogeneity Large scale data management Cloud data storage 1 Introduction Massive popularity and wide-scale deployment of the Internet has enormously increased the rate of data generation and computation [45, 67]. This huge growth has also highlighted immense potential for utilization and analysis of data over a wide set of users and its applications. Consequently, unprecedented data-related challenges have emerged. Consider an example of a simple Internet search engine that ranks documents on the basis of relative frequency of search terms in its datacollection. The search engine could be enhanced if it includes consideration of user-clicks while obtaining popular results. Similarly, geographical location of users could be incorporated to increase relevancy. The two enhancements mentioned here may seem plausible; however, considering the massive dataset of Internet documents and the diverse geo-location of Internet users, they require comprehensive collection, efficient storage and retrieval, extensive linkage, meticulous investigation, and methodical analysis; most importantly, in a precise and timely manner. Further, extensive
2 J.Shamsietal. requirements of meeting availability, scalability, and high performance also exist. The extensive challenges mentioned above are not restricted to search engines. With the emergence of clouds, the notion of computing has incorporated new requirements of providing efficient user access and storage . Further, the terms of availability and scalability are inherent with cloud systems. In addition, for a multi-user system, a cloud system needs to fulfill the requirements of privacy and access controls. In the data-intensive world we live, requirements and challenges also vary with applications, For example, an iterative application such as page-rank computation algorithm requires iterative computation until a point of convergence is reached. In comparison, a streaming application would prefer processing stream of events in order to provide timely results. In this research, we are motivated by the huge growth at which data is being generated, the massive contribution it has made to different applications, and the enormous potential it possess in improving performance of computing systems. These considerations necessitate the following questions: 1) what are the challenges and requirements associated with dif ferent data-intensive applications?, and 2) are the computing platforms capable to provide ef f icient solutions in this paradigm? Through this paper, we plan to investigate along these questions. We provide an extensive survey to the academic, research, developer and industrial communities by exploring requirements of data-intensive clouds, investigating the existing challenges, and studying the available solutions. In analyzing these issues, we consider a wide range of scenarios, including infrastructure related problems and platform related matters. Realizing the significance and wide-scale deployment of Hadoop and MapReduce, the paper also describes various extensions of the Hadoop framework which have been proposed to enhance performance. In a recent study , Sakr et al. surveyed large scale data management approaches in clouds. Our work is much different than this survey paper. First and foremost, in comparison to the survey paper, we adopt a challengecentric approach. In that, we did an extensive search of the literature and identified requirements and challenges related to data intensive computing. Specific to each challenge, we describe solutions and analyze their strengths and weaknesses. Second, our approach is more extensive in which we consider many issues related to physical and infrastructural requirements of data-intensive clouds. These include network constraints, resource sharing considerations, billing issues, effect of hardware advancements, data placement matters and many other related considerations. Third, we also discussed application-specific challenges, such as capabilities for iterative algorithms and suitability for join operations. Considering the large scale expansion of data-intensive cloud computing, in which many applications and platforms have been utilized, we dedicate a separate section on application-specific enhancements in order to provide an extensive view about application specific challenges and solutions. Consequently, our work is significant with multiple benefits. For researchers, it provides a comprehensive analysis of the existing work and identifies challenges; whereas for academicians, it offers a thorough study of the subject. Our work is also useful for the developer community in understanding strengths and weaknesses of different solutions. The industrial community could also find our work useful in understanding the requirements and assessing capabilities of these solutions. The remainder of this paper is organized as follows: Section 2 explains different concepts about data intensive computing and describes various requirements in the field. Section 3 elaborates on challenges and solutions, while Section 4 mentions application-specific solutions for dataintensive systems. Section 5 concludes the paper with analysis and future directions of research. 2 Data-Intensive Clouds This section explains background concepts about data intensive computing. The section begins by
3 Data-Intensive Cloud Computing explaining information about data intensive computing and cloud computing. It builds upon this discussion to define data-intensive cloud computing and continue upon this definition to mention various requirements and issues associated with the domain. 2.1 Background Information Data Intensive computing refers to computing of large scale data. Gorton et al. describe types of applications and research issues for data intensive systems . Such systems may either include pure data-intensive systems or they may also contain data/compute-intensive systems. In that, the former type of systems devote most of their time to data manipulation or data I/O, whereas in the latter type data computation is dominant. Normally, parallelization techniques and high performance computing  are adopted to encounter the challenges related to data/compute-intensive systems. With the growth of data-intensive computing, traditional differences between data/computeintensive systems and pure data-intensive systems have started to merge and both are collectively referred as data-intensive systems. Major research issues for data-intensive systems include management, handling, fusion, and analysis of data. Often, time-sensitive applications are also deployed on data-intensive systems. The Pacific Northwest National Laboratory has proposed a comprehensive definition: Data- Intensive computing is managing, analyzing, and understanding data at volumes and rates that push the frontiers of current technologies . A wide set of requirements and issues arise when data-intensive applications are deployed on clouds. The cloud must be scalable and available. It should also facilitate huge data analysis and massive input/output operations. Considering the administrative challenges and the development requirements a cloud should offer, we propose the following definition for data-intensive cloud computing: Data-intensive cloud computing involves study of both programming techniques and platforms to solve data-intensive tasks and management and administration of hardware and software which can facilitate these solutions. Depending upon its usage, a data-intensive cloud could either be deployed as a private cloud supporting users of a specific organization, or it may be deployed as a public cloud providing shared resources to a number of users. A data-intensive cloud entails many challenges and issues. These include data-centric issues such as implementing efficient algorithms and techniques to store, manage, retrieve, and analyze the data and communication-centric issues such as dissipation of information, placement of replicas, data locality, and retrieval of data. Note that issues in the two categories may be interrelated. For instance, data locality often leads to faster execution of data. Grossman and Gu  discussed varieties of cloud infrastructures for data intensive computing. Figure 1 illustrates the two architectural models for such a system: a cloud could provide EC2- like instances for data-intensive computing, or it could offer computing platforms (like MapReduce) to its users. In the former case, a user is required to select tools and a platform for computing, and the cloud provider is responsible for storage and computing power. The provider is also liable for replication, fault tolerance, and consistency. In comparison, for platform-based cloud computing, application-specific solutions [20, 132] exist which provide enhanced performance. Fig. 1 Architecture model of data-intensive cloud computing
4 J.Shamsietal. In this paper, we mainly resort to the latter category (data-intensive computing platforms) as they specifically address challenges and solutions to data intensive computing. However, during the paper, we discuss a few infrastructure-related issues such as effective network utilization and resource sharing which may well be applied to both the types. 2.2 Suitable Systems for Data-Intensive Cloud Computing In order to comprehend the challenges of dataintensive clouds it is pertinent to understand the related types of systems which can utilize dataintensive clouds. In a research study, Abadi  discusses the possibilities of types of data-intensive systems which can be deployed on a cloud. The author compares the requirements for transactional systems and analytical systems. Transactional systems rely on ACID (Atomicity, Consistency, Isolation, and Durability) guarantees provided by databases. The author mentions that such systems are unlikely to be deployed on a cloud because of the difficulties in facilitating locks, commits, and data partitioning in a shared-nothing architecture. Further, ACID guarantees are difficult to maintain over a cloud system, which is replicated and distributed across multiple geographical locations . Such systems have strong requirements of privacy and trust. For these systems, fault tolerance is generally denoted as the capability of the system to ensure ACID guarantees in case of a fault. In comparison, analytical systems have mostly write once and read many architecture. For such systems, requirements for distributed locking and commit are relaxed. They are more suitable for a shared-nothing architecture as the query load can be divided across multiple hosts. For such systems, fault tolerance is the ability of the system to provide un-interrupted execution of query. Such systems are therefore more likely to avail the benefits of cloud systems. Considering the appropriateness of analytical systems for data-intensive applications, two types of software platforms can be used to build data-intensive clouds: (1) Parallel Databases with shared-nothing architecture and (2) NOSQL systems, which are distributed and non-relational data storage systems. In databases relying on shared-nothing architecture (such as Teradata  and Gamma ), a table is horizontally divided across multiple nodes. The division can be implemented either in a round robin manner or through hashing of indexes . Distributing indexes gives advantages of distributed query processing and heavy storage. Results of queries from individual nodes are merged and shuffled to process final results. These databases have fast retrieval capabilities, which are aided through advancements in indexing such as B-trees. In comparison, NOSQL systems such as MapReduce (Hadoop) , MongoDB , and Cassandra  do not support a descriptive SQL language for query processing. The storage is normally provided through a distributed storage, which is spanned across large number of machines. The lack of strong consistency in NOSQL (which are also referred as MR-like) systems has been debated in research community. NOSQL systems appeared to be inspired from the CAP theorem  which states that out of the three characteristics of consistency, availability, and no partition, only two can be achieved at a time by a distributed system. However, in a blog, Abadi  highlighted some potential problems in the CAP theorem. Abadi argued that it is not necessary that consistency be compromised only to achieve availability. Instead, consistency may also be compromised for latency. For instance, Yahoo s PNUTS  relaxes consistency (by implementing eventual consistency) and availability in order to achieve low latency. Similarly, in case of network partition, Dynamo DB from Amazon  relaxes consistency to achieve availability; whereas, under the normal scenario, it gives up consistency in order to decrease latency. NOSQL systems such as MapReduce have also been compared with parallel databases. In a blog , David DeWitt and Michael Stonebraker mentioned the lack of schema as a major limitation for MR-like systems. This implies that retrieval of documents would be slower due to lack of indexed data.
5 Data-Intensive Cloud Computing Later, in a research study , Pavlo et al. have compared the two models for data-intensive tasks. The authors argued that the flexible SQL environment and high speed of execution are propelling for parallel databases. In comparison, MapReduce offers ease of installation. In response to the arguments from these two sources, Dean and Ghemawat the two proponents of the MapReduce system, highlighted heterogeneity and fault tolerance as the two major strengths of the MapReduce system . The authors also mentioned that MapReduce is powerful to compute several complex tasks such as computing in-links and out-links for page-ranking. Some researchers have proposed the use of MapReduce in conjunction with databases . The authors argued that MR systems are useful for ETL (Extract Transform Load) capabilities, whereas the database system could be used for efficient query-processing. Other major advantages of MapReduce (Hadoop) over parallel databases are open source architecture and ease of installation. MapReduce is also cost effective compared to parallel databases. The ability to read encrypted and compressed data has also been considered as major requirements for both the architectures [1, 110]. Although the original version of MapReduce does not provide these features, possibilities exist in which encrypted data can be read . While scalability, cost effectiveness, heterogeneity, and fault tolerance have been characteristics of MapReduce-style frameworks [34, 61, 72], Fig. 2 Types of data-intensive systems speed of execution and ease of development have been the propelling reasons for shared-nothing databases. Figure 2 illustrates the comparison between the two platforms. Considering this comparison, many extensions and application-specific solutions have been proposed for MR-like systems. For instance, MR-like systems were initially argued to be restricted to batch processing. However, many solutions have been proposed to introduce real-time processing  or stream processing  in cloud. Wide usage and open source architecture have yielded many application-specific solutions. These solutions demonstrate variety of usages such as indexing , join , faster execution , transactional systems , and streaming . A detailed description of these solutions is mentioned in Section 4. We now explain the requirements and expectations for data-intensive clouds. 2.3 Requirements and Expectations of Data-Intensive Clouds A data intensive cloud system entails several requirements related to scalability, availability and elasticity . Further, issues such as infrastructure support, hardware issues, and software platforms are also important. Depending upon the scope of an application and the type of services a cloud provides, these requirements may vary for each application. Note that a data-intensive cloud is different from a traditional cloud. In that, the former is capable to process and manage massive amount of data. However, in addition to the challenges related to data processing and management, a data-intensive system should also meet the requirements of a traditional cloud system  such as scalability, fault tolerance, and availability. We now describe significant requirements for data-intensive clouds. We have mentioned these requirements with respect to data-intensive computing. 1) Scalability A data-intensive cloud should be able to support a large number of users without any noticeable performance degradation. Large scaling
6 J.Shamsietal. may be achieved through addition of commodity hardware. 2) Availability and Fault Tolerance The strict requirement of availability is tied with the ability of the system to tolerate faults. Faults could occur at the infrastructure/ physical layer or they could also arise at the platform (or application) layer. As mentioned, in analytical systems, fault tolerance denotes the capability of the system to facilitate query execution with little interruption. Comparatively, in transactional systems, ACID guarantees must be ensured . Overall, the system should have the ability to sustain both the transient failures (such as network congestion, bandwidth limitation and CPU availability) and persistent failures (such as network outages, power faults, and disk failures). 3) Flexibility and Efficient User Access Mechanism A data-intensive cloud should facilitate a flexible development environment in which desired tasks and queries should be easily implemented. A significant requirement is to facilitate efficient mechanism for data access. For intensive tasks, the framework should also support parallel and high performance access and computing methods. 4) Elasticity Elasticity refers to the capability of the cloud to utilize system resources as per the needs and usage. This implies that more capacity can be added to existing system . The resources may shrink or grow according to the current state of the cloud. 5) Sharing Effective Resource Utilization Many applications share clouds for their computation. This is specifically true for a private cloud. For instance in , the authors mentioned that data is shared between multiple applications of Facebook. Sharing reduces the overhead of data-duplication and yields in better resource utilization. Efficient and effective mechanisms are needed to facilitate this sharing requirement. 6) Heterogeneous Environment The cloud system should support heterogeneous infrastructure. Homogenous configuration is not always possible for data-intensive systems . In such an environment, issues such as differing computation power across cloud machines, varying disk speeds , and networking hardware with dissimilar capacity are not infrequent. Consequently, a cloud may have to encounter varying delays. 7) Data Placement and Data Locality Big data systems have complex requirements of data placement . Issues to be considered include, data locality, fast data loading and query processing, efficient storage space utilization, reduce network overhead, ability to support various work patterns, and low power. Multiple copies of data sets may be maintained to achieve fault tolerance, load balancing, availability, and data locality . Consistency requirements vary with the type of application being hosted on the cloud. It has also been suggested that data-intensive applications with strong consistency requirements are less likely to be deployed on clouds . 8) Effective Data Handling Fault tolerance should be aided by effective data handling. For instance, many tasks in dataintensive computing are multi-stage. Handling of intermediate data is important for such tasks. A failure in intermediate steps of the work flow should not drastically effect system execution. 9) Effective Storage Mechanism The storage mechanism should facilitate fast and efficient retrieval of documents. Since data is distributed, effective utilization of disk is important. 10) Support For Large Data Sets In a cloud environment, data-intensive systems should provide scalable support for huge datasets . A cloud should be able to execute a large number of queries with only a small latency . Considering the varieties of data-intensive computing, characteristics such as huge files and large number of small files in the directory are also beneficial. 11) Privacy and Access Control In cloud computing, data is outsourced and stored on cloud servers. With this requirement, issues
7 Data-Intensive Cloud Computing of data protection data privacy are induced. Although encryption may be used to protect sensitive data, it induces additional cost of encryption or decryption in the system. 12) Billing For a public cloud, an efficient billing mechanism is needed as it covers the cost of cloud operations. A user may be charged on the basis of three components. These include (i) data storage, (ii) data access, and (iii) data computation. The inclusion of these components in the billing may vary depending upon the type of service a provider offers to its customers. 13) Power Efficiency Data intensive cluster consume tons of electrical power. Low-power solutions save infrastructural cost and ease cooling requirements. In a powerconstraint environment, such solutions could also yield lead to enhanced capacity and increased computational power. 14) Efficient Network Setup Cloud providers use over-provisioning for profit maximization. In a multi-user cloud environment, network problems such as congestion, bandwidth limitation, and excessive network delays could be induced. Problems such as high packet loss and TCP Incast  could also arise. A data intensive cloud should be able to encounter these challenges. Effective bandwidth utilization, efficient downloading and uploading, and low latency data-access are critical requirements. However, with multiple users accessing intensive data in the cloud, network problems such as congestion, bandwidth limitations, and TCP Incast are plausible. 15) Efficiency A data-intensive computing system must be efficient in fulfilling its core tasks. Intensive tasks require multi-stage pipeline execution, intelligent workflows, and effective distribution and retrieval capabilities. These requirements collectively determine the efficiency of the system. With the diversity in data intensive computing, algorithms and techniques also vary for each application. For instance, some algorithms (such as page-rank or N-body computation) require optimization for iterative computation. The above set of requirements provides a comprehensive view of the needs and objectives of a data intensive system. Meeting these requirements is essential for improving the efficacy and applicability of a system. 3 Challenges and Solutions This section discusses and elaborates on various challenges and solutions related to data intensive cloud computing. The discussion is motivated by the requirements and expectations mentioned in the previous section. However, in mentioning these challenges and their solutions, we remained focus on data-intensive paradigm. For instance, we do not elaborate on issues such as backup power to promote availability as they are considered outside the scope of the paper. Table 1 presents a summary of the challenges and related solutions for data-intensive computing. 3.1 Scalability Challenges A cloud should be well-equipped to provide scalability. While adding physical resources contributes toward increasing scalability; effective management and utilization of resources and appropriate mechanism of task-mapping are critical in maintaining them. Solutions Scalability is the core requirements for data-intensive clouds. In order to support these requirements, there could be numerous considerations related to file system, programming platforms, storage and database systems, and dataware house systems. We now describe scalable solutions which exist at file system, platform, and database and storage layers (Fig. 3). Note that data warehousingbased solutions are described in Section 3.4 under flexibility and efficient user access. 1) File System Scalability in file system determines the capability of the file system to store and process big data. Distributed file systems such as (GFS)  and
8 J.Shamsietal. Table 1 Challenges and solutions of data intensive cloud computing S. no. Challenge Content 1. Scalability MapReduce , Hadoop , Data warehousing and analytics infrastructure at Facebook , Hive , Cassandra , BigTable , GFS , MongoDB , Dynamo DB , Hbase , Pig  2. Availability, fault detection, Globally Distributed Storage Systems , MapReduce , HiTune , and fault tolerance CloudSense , Riak  3. Flexibility and efficient Dryad , DryadLINQ , Pig Latin , All-Pairs , Sawzall  user access 4. Elasticity ElasTras  Zephyr  5. Sharing of a cluster for Otus , Mesos  Google compute clusters , ARIA , delay multiple platforms scheduling , disk head scheduling [120, 121] 6. Heterogeneous system LATE , MR-Predict , heterogeneity-aware task distribution . 7. Data handling, locality, Volley , RCFile , automatic replication of intermediate data  and placement 8. Effective storage mechanism pwalrus , Megastore , DiskReduce  9. Privacy and access control Delegation of RESTful Resources in storage cloud , Airavat  10. Billing Exertion-based billing for cloud  11. Power efficiency FAWN , MAR , BEEMR , CS and AIS  12. Network problems TCP RTO for Incast , WhyHigh  13. Efficiency Section 4 Hadoop Distributed File Systems (HDFS)  have provided significant solutions for data intensive systems. Both GFS and HDFS have many similarities. They store data in large blocks of size 64 MB, which allows low seek- time and increased efficiency. In HDFS, specific nodes (called data nodes) store all the data. Meta information is stored on name node, which provides lookup services. Each block is by default replicated thrice on data nodes. High replication improves availability and data locality a concept in which a task is preferred to be executed near to the location of data. This reduces network bottleneck. Both GFS and HDFS share many similarities, where the latter has been inspired by the former. However a different set of naming conventions are used in GFS. Fig. 3 Scalability at different layers 2) Programming Platforms MapReduce has been the most popular programming platform for data-intensive computing. It was initially proposed by Google . Later, it was adopted by the open source community as the Hadoop project. Hadoop provides execution of MapReduce tasks over a cluster of machines which are based on commodity hardware. The core functionality of the framework is provided through its two phases Map and Reduce. In both the phases, a notion of <key, value> pair is used for input and output operations. In the Map phase, an intensive task is divided into a large number of smaller, independent, and identical map tasks. Each map task is executed independently on one of the available node on the cluster. While scheduling map tasks on a cluster, the MapReduce framework strives to ensure data locality. The reduce phase involves aggregation of <key, value> pairs from all the map tasks over the network. In that, all the <key, value> pairs that are emitted in the map phase, are merged and delivered to the node which executes the reducer. Many organizations such as Facebook , Yahoo, and Adobe have implemented Hadoop-
9 Data-Intensive Cloud Computing based solutions. Google has implemented a proprietary version of MapReduce, where it is used for the generation of data for production web search service, sorting, data mining, machine learning, and many other systems . Sector/Sphere  are solutions for distributed data-intensive computing. In that, Sector is a filebased distributed file system, which is similar to GFS/HDFS, whereas Sphere is a distributed programming platform that utilizes Sector. Sphere is tightly coupled with sector. In that, sphere applications can provide feedback to sector to improve data locality. Sphere provides greater flexibility by allowing arbitrary UDF (User defined Functions). Sector can also support data processing on various levels of granularity. The authors of Sector reported 2 4 times faster execution as compared to Hadoop. However, scalability of Sector/Sphere has not been discussed. 3) Distributed Storage and Database Systems BigTable  is a scalable data storage platform from Google which stores data from many Google applications. It is spanned over thousands of commodity servers with size in petabytes. The data is stored in the form of a sparse multi-dimensional, sorted map, in which the data is indexed by a rowkey, a column-key, and the time stamp. BigTable is built upon the GFS, which is used to store data on the servers. Hbase  is an open source version of the BigTable distributed storage system. It runs on top of HDFS and provides BigTable-like capabilities for the management of large volume of structured data. Hbase is written in Java to achieve platform-independence. Because of its portability and the capability to scale to a very large size, it is being used in many data-center applications including Facebook, Twitter, etc. Usually, HBase and HDFS are deployed in the same cluster to improve data locality . HBase consists of three major components which include HBaseMaster, HRegionServer and HBaseClient . The master is responsible for assigning regions to region servers and for recovery. In addition, the master is also responsible for managing administrative tasks such as resizing of regions, replication of data among different region servers. The client is responsible for finding region servers for which it should request for read/write operations. The region server is responsible for managing client read and write requests. It communicates with the master to get a list of regions to serve and to tell the master that it is alive. HBase is planned to be supported by a query language HBQL . Cassandra  is a distributed data management system implemented by Facebook. It allows users to manage very huge dataset, distributed over a number of commodity hardware, with no single point of failure. The structure of Cassandra is key-value store. Cassandra adapts a column-oriented approach in which data is stored as sections of databases such that keys are mapped to various values grouped by column families. Although the number of column families is defined during the creation of database, columns can be dynamically added or removed in a family. Cassandra also incorporates a roworiented approach in that the values from a column family for each key are stored together. The combination of column oriented and row-oriented approaches leads Cassandra to a hybrid model for data storage and management. Cassandra uses a peer-to-peer model which means that there is no single master but all the nodes are working as masters. This result in highly scalability in both read and writes operation. MongoDB  is an open source, documentbased database management system, which is designed for storing, retrieving, and managing document-oriented or semi-structured data. It provides support for many features such as dynamic queries and secondary indexes, fast atomic updates, and replication with automatic failover. Replication is provided via a topology known as a replica set. Replica sets distribute data for redundancy and automate failover in the event of outages. Most replica sets consists of one primary node and one or more secondary nodes. Clients direct all writes to the primary node, while the secondary nodes are read-only and replicate from the primary asynchronously. If the primary node fails, the cluster will pick a secondary node and automatically promote it to primary in order to support automated failover. However, when the
10 J.Shamsietal. earlier primary appears online it will work as a secondary. MongoDB also have the ability to scale horizontally via a range-based partitioning mechanism, known as auto-sharding. Through this data is automatically distributed and managed across nodes. Amazon Dynamo DB  is another highly available and scalable distributed data store that is built for Amazon s AWS cloud platform. Dynamo is a key-value storage system which has been successful in managing server failures, data center failures, and network partitions. It provides desired level of availability and performance to the user. One of the great features of Dynamo is incremental scalability which enables it to allow service owners to scale up and down depending upon their current request load. To achieve high availability, Dynamo sacrifices consistency. It uses consistent hashing  todistribute load among multiple storage hosts. Object versioning is used in order to facilitate multiple versions of an object. Management is initiated with a generic object and subsequent versions are used to reflect in changes. The Dynamo DB system is completely decentralized, in which adding or removing storage nodes do not require any manual partitioning or redistribution. Dynamo has been able to provide scalable storage services for S3 and other related services of AWS. Riak  is a distributed and scalable NOSQL database. It utilizes MapReduce as a platform. Its main strength is the distributed architecture, which avoids a single point of failure. It stores data in buckets, which is similar to the concept of tables in a relational database. Riak is hosted on a cluster of machines, in which the data is distributed among nodes. Each node hosts a set of virtual nodes, where each virtual node is responsible for some data-storage. In Riak, data-storage is computed using a 160-bit binary hash of bucket and key pair. Riak incorporate eventual consistency and high fault tolerance. Each key has designated primary and secondary vnodes. Riak also provides replication of data, where frequency of replication is managed by the user. 3.2 Availability, Fault Detection, and Fault Tolerance Challenges For big data clouds, faults are norm and failures and crashes could occur at any time. Although in a distributed environment, MapReduce (and similar systems) provide high fault tolerance and high availability in which only the tasks which do not respond in a reasonable amount of time are restarted through speculative execution; fault detection and determining the reason of failures remains an issue due to large size of thecluster.in, the authors argue that timebased detection of faults for MapReduce systems is difficult as the execution depends upon number of factors including the size of the cluster, size of the input, and the type of the job. Solutions In , the authors presented fault and availability analysis of Google cloud storage system. The system consists of three layers including BigTable, GFS, and Linux file system. Availability and fault tolerance at any of the layer is significant in promoting availability of the cloud. The authors mentioned that a node could become unavailable for a large number of reasons including overloading of the node or network failure such as if the response of the heart beat messages are not received by the monitoring system. However, only 10 % of the failures lasted more than 15 minutes. The authors also argue that transient failures do not have a drastic impact on the availability of the cloud due to high replication strategies. In addition, the authors observed that much of the nontransient failures happen in bursts which occur due to rack failures. Analysis on the past failures has aided authors to develop analytical models for future availability and choices for data placement and replication strategies. Kahuna  is a fault detection tool which is based on detecting performance problems. The idea is that under normal scenarios, MapReduce nodes tend to perform symmetrically and a node which performs differently is the reason for creating performance issues. The similarity is detected by observing different characteristics such
11 Data-Intensive Cloud Computing as CPU-usage, network traffic, and completion times of Map tasks. While the underline theme of Kahuna seems to be justified, it s applicability in an heterogeneous environment remains to be seen. HiTune  is a data flow-based performance analysis tool from Intel which is focused on analyzing cloud run-time behavior. The idea is that run-time analysis could be used to detect failures and improve system performance. It uses trackers, which are installed on each node of the cloud. Each tracker monitors its corresponding node and sends characteristics (such as CPU cycles, disk bandwidth etc) to the aggregation engine. The engine links the information with the help of an analysis engine. As such a flow of execution plan is constructed which can help diagnose performance issues and provide system improvements. The authors describe three test cases from Intel clusters for performance tuning on Hadoop. Through this, problems related to Hadoop scheduling, application hotspots, and slow disk were detected and rectified. HiTune has been analyzed extensively. For instance, processor micro architecture events and power state behaviors of Hadoop jobs have been analyzed using the dataflow model. Moreover, it has also been applied to Hive by extending the original Hadoop dataflow model to include additional phases and stages. 3.3 Flexibility and Efficient User Access Challenges Although MapReduce has been extensively used for data-intensive applications; it offers a rigid programming environment in which any tasks are needed to be converted to map and reduce tasks. Solutions Dryad  is a distributed data processing platform from Microsoft, which supports large-scale data operations over thousands of nodes. It provides improved flexibility as compared to MapReduce system. The Dryad framework implements a Directed Acyclic Graph (DAG) for each job, in that the nodes represent programs or tasks and edges correspond to communication between them. Incorporating the considerations of data locality and availability of resources, the graph is automatically mapped on physical resources by the Dryad execution framework. Dryad is supported by DryadLINQ  a procedural language to specify tasks. Much of the simplicity of the Dryad (scheduler and fault tolerance model) stems from the assumption that vertices are deterministic. In case of non-deterministic vertices in an application, it must guarantee that every terminating execution produces an output and failure-free execution could be generated. In a general case, where vertices can produce side-effects, it might be very difficult to ensure this. Pig  and Hive [58, 115] are data warehousing systems built on top of Hadoop. They allow queries and analysis on large data set stored on Hadoop-compatible file system. Pig uses a scripting language, PigLatin , by using which a programmer is free from writing MapReduce tasks. Instead, these tasks are generated by the system in response to the scripting language. In comparison, Hive uses a SQL-like declarative language known as HiveQL, which allows a user to write custom MapReduce functions, where applicable. HiveQL is compiled in to MapReduce jobs which are executed in parallel on Hadoop. Data is stored in tables, where each table consists of a number of rows and a fixed number of columns. A type is associated with each column, which can either be primitive (like integer, float or string) or complex (like arrays, lists or structs). Similar efforts have been made by Moretti et al. . In that, the authors provide a high-level programming abstraction of the All-Pairs problem. The idea is to free a developer from issues such as resource sharing, management, and parallelism and provide an efficient solution for a popular data-intensive problem. Sawzall , is a high performance computing system from Google. It is motivated to provide ease of interface on a distributed cluster environment such as MapReduce. For a very large data set, which is distributed over hundreds or thousands of machines, a relational database approach for querying and analysis is not feasible. Sawzall
12 J.Shamsietal. exploits inherent parallelism of such systems by using a procedural language. Computation is performed in two phases: In the filtering phase, each host executes the query on the portion of dataset stored on it. The results from each host are collected in the aggregator phase, which combines the results and store them in a file. Although Sawzall is useful only for tasks which are associative and commutative, it provides a simple and powerful platform by masking the complexities of parallel programming. 3.4 Elasticity Challenges Data-intensive clouds should be capable to scale according to the state of the system. That is, during moments of high-demand and flash crowds the cloud should scale to meet the needs. Similarly, during periods of low-usage, the cloud should shrink. These adjustments in the cloud are supported through virtualization, where Virtual Machines (VMs) are migrated from one physical machine to another in order to support resource provisioning and elastic load balancing. While these solutions are established at the infrastructure layer, at the application layer (or the data storage layer) challenges arise due to the possibility of service interruption which may happen due to live migration. Further, scaling out implies partitioning of a database. In addition, query processing during the process of migration is also likely to be effected. The intensity of the challenge is likely to be increased with multitenancy, which is a promising feature provided by data-intensive clouds. Solutions ElasTras  is a transactional distributed data store for cloud systems. It provides elasticity through transactional managers which are capable to allocate/de-allocate resources on demand. Zephyr  adds to the capabilities of Elastree by incorporating live migration. It minimizes service interruption by allowing transactions both at the source and destination, simultaneously. The process of migration involves transfer of metadata to the destination. Once the transfer of metadata is completed, new transactions are initiated at the destination, while existing transactions are being completed at the source. The work proposed by Zephyr is important, as live migration is significant to provide elasticity. However, techniques for load balancing and affinity aware destination selection are also needed to be incorporated. Note that infrastructure-layer related solutions are not discussed as they are considered outside the scope of the paper. 3.5 Sharing of a Clusters for Multiple Platforms Sharing of clusters induce multiple challenges of sharing of data, hardware resources, and network resources . The severity of challenges increases if resource constraints and timing requirements are needed to be satisfied. 1) Understanding Resource Requirements Challenges For data intensive systems, such as Hadoop and Dryad, understanding resource requirements and usage is important. This is specifically true when batch jobs are continuously being executed for different datasets. Understanding resource usage and attribution for such systems would provide detailed information about requirements of a cluster and performance monitoring for the applications being executed. Solution: Otus is a resource attribution tool which monitors the behavior of data-intensive applications in a cluster. It observes events from different software stacks such as Operating Systems and MapReduce to infer resource utilization and relate performance of the service components. 2) Data and Resource Sharing Challenges While multiple platforms (such as Dryad, Hadoop, MPI etc.) exist for data-intensive computing; each has its own significance and there is no platform which is efficient and optimal for all the data intensive applications. In , the authors reported that scenarios exist for Facebook and Yahoo users, where they may like to build multiple clusters for their usage. A simple approach is to setup a separate cluster for each application and transfer data in each. However, this technique is inefficient as it requires data duplication.
13 Data-Intensive Cloud Computing Solution In , the authors proposed Mesos which provides sharing capability for multiple frameworks on the same cluster. Mesos acts as an intermediary between the cluster and the framework by offering resources to each framework. The user accepts the resources as per her choice. The framework is also responsible to schedule tasks on these resources. The architecture of Mesos is simple; it requires a single master for the whole cluster and a slave for each node. Slave nodes communicate with the master and offer resources to multiple frameworks, whereas the master does the coordination and resource scheduling. A major limitation of Mesos is that it requires porting for different frameworks. Further, the centralized master could become a single point of failure. Under high scalability requirements, this may lead to poor performance. 3) Meeting Resource and Timing Constraints Challenges In a shared environment, users may compete for resources to meet their timing deadlines. In such a scenario, resource scheduling plays a critical part in meeting an application s expectations. Consider Hadoop; it has a fair scheduler , which ensures that each user gets a minimum number of resources for task execution. However, it does not provide any assurance that user s requirements for task execution are met in a shared environment. In addition, there is always a conflict between fairness in scheduling and data locality . In large clusters, tasks complete at such a high rate that resources can be reassigned to new jobs on a timescale much smaller than job durations. However, a strict implementation of fair sharing compromises locality, because the job to be scheduled next according to fairness might not have data on the nodes that are currently free. Similarly, for intensive tasks, resource assignment such that task placement constraints are satisfied is important. In , Sharma et al. argues that for long running jobs, reducing the delays in task assignment is significant. The authors performed a study on Google clusters. They identified two major types of constraints: 1) Hardware architecture and 2) Kernel version and observed that constraints could increase the task assignment delays to two to six times. Solutions In , the authors are motivated to solve this problem by providing a scheduler which can estimate the number of required map and reduce tasks in order to meet soft guarantee for a user. The proposed framework ARIA (Automatic Resource Inference and Allocation) , builds a profile for a new job by analyzing different phases (such as map, reduce, shuffle, and sort). Based on this profile and user s service level expectations, task execution parameters are estimated for subsequent jobs. The framework also incorporates a scheduler which determines order of jobs and the resources required to execute them. The proposed model has been designed for scenarios without node failures. This model needs to be extended and evaluated for different cases that incorporate failures. In , the authors have proposed a delay scheduling algorithm which improves locality at the expense of fairness. The idea is that if a local task cannot be executed than any other task may be executed for a small duration. The algorithm temporarily relaxes fairness to improve locality by asking jobs to wait for a scheduling opportunity on a node with local data. Experiments reveal that very small amount of waiting is enough to bring locality close to 100 %. Delay scheduling performs well in typical Hadoop workloads because Hadoop tasks are short relative to jobs, and because there are multiple locations where a task can run to access each data block. The scheme performs well in environments where most tasks are short and multiple locations are available in which a task can run to read a given data block. However, the effectiveness needs to be evaluated for different behaviors such as longer tasks and or fewer data blocks. 4) Disk Head Scheduling Challenges In a shared environment, read requests for multiple workloads may be issued. Under such a scenario, disk scheduling could play a significant role in access latencies of users. For instance, interdependence between different
14 J.Shamsietal. datasets or interference between different workloads could reduce the speed of the disk I/O. Challenges arise if the data is striped across multiple clusters. If heads are not co-scheduled than a client may have to wait until it is able to read the data from all the servers. The problem becomes severe, considering that access patterns are not pre-defined and multiple users access the cluster at the same time. Solution In [120, 121], the authors proposed a disk-scheduling scheme which co-schedules data across all servers in the cluster. The scheme also provides performance insulation such that performance of individual workloads does not degrade when they share a cluster. The scheme minimizes interactions between datasets by time slicing disk heads and slack assignment. 3.6 Heterogeneous Environment 1) Timely Completion of Slow Tasks Challenges In heterogeneous systems, execution speed of tasks varies because the hardware resources such as CPU processing speed, access speed and bandwidth of disks , and networking equipment varies throughout the cluster. In such a scenario, Hadoop s strategy of initiating a redundant task in response to slow nodes (called speculative execution), is ineffective , as it is based on a heuristic in which slow tasks are detected by comparing a task s completion status with average task execution in the system. Solution For such systems, Zaharia et al.  proposed a scheduling algorithm LATE (Longest Approximate Time to End), which identifies slow tasks and prioritizes them according to their expected completion time. Slow tasks are executed on fast available nodes to prevent thrashing and promote timely completion of jobs. The authors show that LATE can improve the response time of MapReduce jobs by a factor of two in large clusters on EC2. The authors also evaluated the algorithm by running different jobs like Sort, Grep and WordCount application. The performance of LATE is more effective in Grep and Sort compare to WordCount. The reason is that reducer has to perform more work on Grep and Sort application. In jobs where reducers do more work and maps are a smaller fraction of the total time, LATE will work more efficiently compare to Hadoop s scheduler. 2) Scheduling Workloads Challenges In a heterogeneous cluster, tasks are executed at differing speeds. I/O bound jobs spent more time in reading and writing data, whereas CPU bound jobs rely on CPU for completion. In such a scenario, it is important that workloads are distributed according to the processing capabilities. Solutions MR-Predict  is focused to solve the above mentioned challenge. It divides MapReduce workload in three categories based on their I/O and CPU load. For any new task, workload type is predicted by the MR-Predict framework. The task is then handled accordingly. In that, separate queues are maintained for each category of workload. 3) Heterogeneity-Aware Task Distribution Challenges In a heterogeneous environment where processing speeds are likely to vary among nodes, high processing nodes are expected to complete more tasks. In such a scenario, the concept of equally distributing data in order to provide data locality is likely to create network bottleneck. For heterogeneous systems, effective data placement strategies are required in order to ensure efficient task scheduling. Solution Xie  proposed that in order to distribute tasks according to the capabilities of nodes in a heterogeneous cluster, data should be available to high processing nodes so that tasks can be readily assigned to them and time to transfer the data can be reduced. As a solution, data locality is tied with the processing capabilities of the node such that more data is stored at the nodes with high processing speed. Thus when high processing nodes complete their tasks, additional tasks can be assigned to them without incurring further delay. A potential problem with this technique is that it is assumed that high processing nodes have
15 Data-Intensive Cloud Computing sufficient storage available to store more data. In addition, processing speeds in the cluster may vary due to multi-tenancy. 3.7 Data Handling, Locality, and Placement Data placement refers to the decision of determining location of data for storage. It is often tied with data locality a concept of locating data in close proximity of execution of task. This reduces network latency and decreases amount of network traffic. The strategy has been well adopted for MapReduce in order to reduce network bottleneck. A naive approach to locate data in a close proximity of the user is to replicate copies near user s location. However, several considerations exist which require an improved strategy . These include: 1) User s Mobility For certain users with very high mobility, it is difficult to determine default locations for mobile users. 2) Data Sharing and Dependence In a cloud environment, data is often dependent or shared. Migrating one date set could provoke undesired operations on other related data sets. Migration may also bring consistency issues in data centers. 3) Bandwidth Limitations Migrating a dataset could be costly if bandwidth is limited. Further, migrating dataset to the nearest location may also affect the available bandwidth for other users. 4) Resource Constraints and Over Provisioning In a cloud environment, data centers are over provisioned for profitability. Migration of data may not only disturb the cost model of the clouds but it may also affect the availability of resources in data centers. 1) Data Placement Challenges For data-intensive applications, placement of data-analysis servers is also critical . If the data analysis is performed on servers dedicated for application, then it may affect the application performance. Conversely, if separate infrastructure is deployed, then it induces additional cost of physical hardware, and brings up the challenge of replication and consistency. Additionally, network latency issues should also be considered. Solution Volley is an automatic data placement tool from Microsoft . Cloud administrators specify locations of data centers and the cost and billing model as input to the system. Administrators also specify the replication model, which describe the number of replicas at a specific location. As a third criteria, the administrators specify their choice between migration cost and better performance; where performance is measured through user perceived latencies and inter-data center latency. Volley takes user access logs as inputs. Access logs contain user IP addresses, description of data items accessed, and the structure of requests. Considering all the above mentioned points, the volley system makes decisions about migration of datasets. In Volley, the model for migration covers many aspects. However, it would be important to observe its performance in a dynamic system where relationship between datasets and dependencies among them are not static. 2) Fast Data Loading and Query Processing Challenges Data placement decisions are also motivated by the needs of fast data loading and fast query processing. For big data systems, rowbased structures are not efficient as undesired columns have to be read. The issue could become severe if a row is spanned over multiple systems. Similarly, column-based structures can cause high network traffic. Therefore, storing data effectively for big data systems remains a challenge. Solution RCfile  is a data placement structure built on top of Hadoop. The authors mention that there are four important requirements of data placement structure. These include: 1) fast data loading, 2) fast query processing, 3) highly efficient storage utilization, and strong adaptavity to dynamic work load patterns. RCfile is motivated to solve these problems. In an RCfile system, a table is partitioned into row groups such that each column is stored separately. Within each row group, data is stored in
16 J.Shamsietal. compressed form to reduce the cost of network transfer. A flexible row group size is allowed in order to meet the challenges of efficient storage utilization. The scheme has been incorporated by the two data warehouse solutions Pig and Hive. 3) Replicating Intermediate Data Challenges For many data intensive applications flow of intermediate data is significant . Many applications including, MapReduce, Dryad, and Pig generate intermediate data. The data is generated in one stage of the task and is used by another stage. The data is normally temporary and serve as bottleneck, which could affect the performance either due to bandwidth constraints or disk failures. Further, loss of this data implies that the task-generating the intermediate data is re-executed and all the other tasks relying on such data would be halted. Solution In , the authors suggested that automatic replication of the intermediate data could reduce the effect of these anticipated failures. Furthermore, Replication can minimize cascaded re-execution. For efficiency, the authors suggested the use of background jobs for data replication. However, the replication technique mentioned here is likely to increase the cost of operation. 3.8 Effective Storage Mechanism 1) Parallel Access Mechanism Challenges A major issue in data-intensive computing is to provide storage mechanism which can facilitate high performance computing and parallel access mechanisms. Parallel file systems such as PVFS  and GPFS  meet these requirements by providing high degree of concurrency and ease of access in writes and reads. However, the user interface mechanism of these systems is restricted and requires additional administrative works for ease of access. Solution pwalrus  is a system which is motivated to fill this gap. The authors argue that for cloud-based data intensive computing, a storage system is needed which can both facilitate the access mechanism of cloud and can also offer services for data intensive computing by allowing random access reads and writes. A pwalrus system consists of a number of Amazon S3 servers; all of them have same parallel file system. All the servers have the same view of the S3 storage, allowing users to connect to any of the server. In addition to accessing the data through S3 objects, users can also access the data by directly accessing the parallel file system. Mapping of S3 objects to file is also available to users. To facilitate access between files and objects the pwalrus system stores additional configuration files. A major limitation of pwalrus is that the simultaneous use of public and private storage services is not possible. Additionally, the issues such as facilitating atomic writes for S3 and files at the same time are under consideration. 2) Low latency for Storage Systems Challenges While data-intensive applications need to be scalable, interactive systems must also possess low latency while meeting the requirements of scalability. At the same time, an interactive system should also be consistent and highly available. Solution Megastore  is motivated to meet the above mentioned requirements for interactive systems. It combines the ACID properties from relational databases and scalability semantics from NoSQL databases. In Megastore, data is partitioned such that ACID semantics are ensured within the partitions, while consistency remains limited across them. Megastore has been implemented for a variety of Google applications. 3) Saving Disk Space Challenges Since data-intensive systems involve massive data, effective utilization of disk space is significant in improving resource utilization. However, high replication factor leads to high disk usage. For instance, in HDFS, each block is replicated thrice for fault tolerance. This leads to 200 % extra utilization of disk space. Solutions DiskReduce  is focused to reduce this overhead. The authors proposed and implemented a RAID based replication mechanism,
17 Data-Intensive Cloud Computing which reduces the disk usage between 10 and 25 %. In DiskReduce, a background process replicates copies of blocks with lower overhead RAID encoding. For each encoded block, the corresponding copy of HDFS block is removed. Current implementation supports RAID5 and RAID6 encodings. While the effort is substantial in reducing the overhead; for time sensitive applications, the process of encoding and decoding could lead to extra time and lower performance. In another work, the authors proposed data compression for increased I/O performance of Hadoop . Performing compression using such configuration in MapReduce job improves both time and energy efficiency . Compression also effectively improves the efficiency of network bandwidth and disk space. The authors also analyze how compression can improve performance and energy efficiency for MapReduce workloads. For read-heavy text data, compression provides % energy savings. For highly compressible data, the savings are even higher. 3.9 Privacy and Access Control 1) Access Control Challenges For a cloud storage system, providing access to multiple users, delegation of rights between the users remains an issue. This is due to the fact that it is highly likely that there would be some resources which are needed to be accessed only by a limited number of users. Large storage systems require interactions between multiple users. This requirement introduces additional challenges of procurement and management of access controls between the users. In such systems, chain servicing ACLs becomes ineffective due to involvement of multiple users . In addition, in large storage systems, dynamic creation of objects may be required and interaction between multiple users and objects may be needed . With data outsourcing and replication, additional requirements of maintaining sovereignty of data exist . Sovereignty implies that data and its replicated copies are stored at a location which does not violate a specific policy. That is, data is stored only at a place where it is allowed to be stored. The problem is challenging as cloud providers may intentionally or un-intentionally replicate copies at locations where it is financially or administratively feasible to them. Solution In , the authors proposed a model which provides dynamic delegation of rights with capabilities for accounting and access confinement. However, the model needs to be evaluated to test functionality and scalability. Balraj  et al. proposed a REST based approach in which query string is passed through a URI. Chain delegation could be built in that user agents are used for delegation of rights. 2) Privacy Challenges For a cloud application which analyzes data, protecting privacy of the data (and the provider of the data) is important. The authors of Airavat  are focused on this goal. They argue that anonymity is not comprehensive as anonymous data has been used to access confidential information in the past . Mandatory access controls (MAC) have been effective but it cannot prevent privacy concerns related to the processed data, as this could be violated through malicious software. Solution Considering these issues, the authors proposed Airavat a MapReduce based solution for protecting privacy of the users. It employs differential privacy  which adds minor noise to the output data without much effect on the quality of the output. Differential privacy provides privacy protection to the data provider. In addition, Airavat utilizes mandatory access controls on top of MapReduce to prevent information leakage through system resources Billing Challenges Incorporating accurate billing mechanism is significant and challenging for dataintensive cloud systems. With massive requirements to access and compute huge amount of data, appropriate and methods are needed to compute billing.
18 J.Shamsietal. Solution For a cloud, it is pertinent to have an efficient billing system. A fair billing system for data intensive computing entails three components . These include: Cost of Data Storage Cost of Data Access Cost of Computation Of these three components, cost related to computation is normally billed in CPU hours. While cost for storage and access are charged in terms of bytes. In , the authors argue that charging data access in terms of bytes is inefficient. Storage access depends upon a number of factors which include data locality, workload characteristics, inter-workload relationship, transfer size, and bandwidth limitations. Billing on number of bytes is unfair and would not include all these factors. The authors suggested that for storage access, the users should be billed according to an exertion based system, such as charging a user according to disk time. Further, inter-workload dependencies should also be minimized. It is important to note that for a multi-user system, access to storage may be delayed due to scheduling. In such a case, the above mentioned billing scheme would be unfair as time for disk access may not be deterministic Power Efficiency Challenges Improving energy efficiency is a major concern for cloud providers . In dataintensive clouds, the complexity of this challenge increases due to many special considerations. For instance, using idle mode for unused resources is a popular method for traditional clouds. However, data-intensive clouds have latency-sensitive requirements of online analysis  and random access . For such systems, using idle mode is not useful. Further, data-intensive systems have strong requirements of scalability. For scalable systems, power requirements are likely to be increased due to addition of resources . Multicore systems are also being utilized for clouds. Reducing power usage for such systems is also desirable . Solutions 1) Random Access FAWN  is a flash-memory based system, which is designed to promote low power for dataintensive applications requiring random access. The focus is on large-key value systems, where data is stored in small objects such as images or tweets. Such systems are I/O intensive requiring large number of random data access requests. For such systems, disk-based storage provides poor seek performance and require high power. Alternatively, DRAM based storage systems are expensive and require high power. FAWN is motivated to provide low power for these applications. A FAWN cluster consists of embedded CPUs which utilizes flash-based storage. The use of embedded CPUs reduces the power requirement, whereas the flash-based storage is suitable for random access. Similarly, Meisnar et al.  analyzed latency and power relationship for OLDI (OnLine Data Intensive) workloads for Google servers. Examples of such systems include search products, online advertising, and machine translation. For such systems, idle mode is not suitable as it leads to very high latency. Instead acceptable query latency can be obtained by using coordinated full system active low power mode. Coordination can lead to balance work load among servers while maintain the power efficiency. 2) Multi-Core Technology Advancements in multi-core technology have lead to their utilization in cloud systems. Shang and Wang  have proposed power saving strategies for multi-core data intensive systems. Such systems have variable CPU and I/O workloads. For non-uniform loads, existing method for switching to busy/idle ratio cannot be very useful. These systems may also have I/O wait operations which may affect the job completion time. The authors have suggested that during I/O wait phases, CPU frequency can be scaled down without having any effect on job completion time. MAR (modeless, adaptive, and rule-based) is a power management scheme which is based on the method of scaling down frequencies during I/O wait operations. Using feedback for I/O wait,
19 Data-Intensive Cloud Computing CPU frequency can be controlled. While MAR has shown improvement in power savings for I/O wait operations, its effect for data intensive systems which do not have long I/O wait phases is likely to be reduced. 3) MapReduce based systems Energy conservation has also been a major focus for MapReduce based systems. In , Chen et al. discuss the possibilities of power consumption for MapReduce based Interactive Analysis (MIA) systems. For interactive systems, conventional strategy of increasing hardware utilization is not sufficient. For such systems, the authors have proposed energy efficient MapReduce called BEEMR (Berkeley Energy Efficient MapReduce). Like the conventional MapReduce framework, The BEMR framework is capable to hold large volume of data. The interactive job is executed on a small pool of dedicated machines which operate on full capacity with their associate storage, whereas less time-sensitive jobs are run on the rest of the machines. The BEMR framework is aided by a workload manager which provides energy-efficient workload management. Similarly, Lang and Patel , have evaluated power saving strategies for MapReduce based cloud systems. The focus is on two categories of techniques for power conservation. In the first approach, CS (Covering Set), a small number of nodes (known as CS nodes) are selected with a high replication factor. In that, at least one copy of each unique block is stored on the CS nodes. During periods of low utilization, some or all of the non-cs nodes are powered off in order to conserve power. The second approach, called All-In Strategy (AIS), differs from the CS technique, in that all the nodes are operated at full speed in order to complete the task. The nodes are switched to idle (low power) only during periods of no utilization. Evaluations reveal that effectiveness of the two techniques is dependent on workload complexity and time of transition from and to low power modes. The CS approach is better only for linear workloads and large transition time, whereas the AIS approach is useful in all the cases Network Problems 1) TCP Incast Challenges Use of commodity hardware has been flourishing in data centers. In such an arrangement, low cost switches, with 48 ports and 1 Gbps bandwidth are used at the top of the rack . With low cost switches and top of the rack setup, TCP Incast  may occur. This happens when multiple senders communicate with a single receiver over a short period of time, and packets from these flows converge on a switch, then the buffer of the switch may become full and packet loss may result. Such instances are possible in data-intensive cloud computing as an application may issue time-sensitive queries to servers that are all connected to a switch. TCP Incast could result in low throughput, excessive delay, and poor user experience. In addition, many data-intensive applications exhibit the characteristics of barrier synchronized workloads. That is, a client (or an application) queries number of clusters and waits to receive response. The client cannot proceed until and unless it receives response from all the servers. Barrier synchronized scenarios could encounter TCP Incast problem due to which long delays might occur. For many data-intensive applications such as search engines and recommendation systems, this setup could induce long delays. Diversity in data-intensive cloud computing implies that traffic requirements are multi-modal. Studies suggest that data center traffic would have requirements of low latency, high burst, and high utilization . Data centers must be equipped to handle all such scenarios. Solution To solve the TCP Incast problem, Vasudevan et al.  proposed that TCP Retransmission Time Out (RTO) be reduced. Through real experiments, the authors observed that microsecond timeouts allowed servers to scale up to 47 in barrier synchronized communication environment. 2) Incorrect Network Configuration Challenges In many data-intensive cloud systems, Content-Distribution Networks (CDNs) are
20 J.Shamsietal. used to reduce client latencies by redirecting clients to the nearest server. In , the authors examine CDN network for Google and observed that redirection does not always provide optimal latency. The authors utilized ping and trace route utilities for anomaly detection and observed that incorrect routing configuration, lack of peering, and traffic engineering are the main causes for latency inflation. Solution The authors conclude that improving CDN performance does not always require adding new nodes. It is equally important that one should effectively use and configure existing nodes. Their solution, WhyHigh  has been in use at Google to improve performance of Google CDN. Above mentioned contributions emphasized the widespread applicability of data intensive systems. Further, they assert that considering the wide scale applicability, application-specific enhancements are pertinent. In the next section, we describe application-level enhancements for dataintensive systems. 4 Application-Specific Solutions for Data Intensive Systems In the previous section, we have discussed general challenges and solutions related to data intensive cloud systems. These were applicable to a wide variety of applications. In data-intensive computing, challenges and solutions also vary with respect to applications and there are a few scenarios where application-specific solutions are developed in order to achieve higher efficiency. For instance, facilitating shared memory could be useful for computing page-ranking. Similarly, efficient utilization of disks could be useful for high speed sorting. The purpose of this section is to elaborate solutions which have been proposed to enhance efficiency of data intensive systems with respect to some specific application. Since cloud has been expanded to incorporate hardware enhancements such as GPUs and single chip cores , the section also elaborates on these hardware enhancements which can be exploited to achieve enhanced performance. Table 2 presents a summarized view of these enhancements. 1) Processing of Incremental Updates Consider an example of computing web-indexes. In such a case the dataset is continuously changing and receiving incremental updates. Under such a scenario, the data-intensive task of computing indexes should only be executed on the modified dataset in order to compute the updated index. In , the authors are motivated by this need. They proposed Percolator system which Table 2 Application-specific solutions for data-intensive systems S. no. Issue Solutions 1 Processing of incremental updates Percolator , Incoop , CBP  2 Stream processing and real-time computation S4 , Storm , Hadoop Online , D-Stream , Meeting user deadlines , Facebook messaging  2 Iterative algorithms Twister , Haloop , Spark , imapreduce  3 Join operations Multi-way Joins [62, 74] 4 Dynamic tasks CIEL  5 Shared memory for page ranking Piccolo  6 Data sampling Lazy MapReduce  7 Searching over encrypted data Rank-based keyword search  8 Sorting TritonSort , GPUTerasort  9 Support for large number of files GIGA+  10 Incorporating hardware enhancements SCC , Mars , Discmarc , MR-J , using GPU and FPGA , Phoenix plus  11 Enhanced scalability for hadoop Hadoop Nextgen  12 Hybrid approach for transactional Hadoop DB  and analytical systems 13 MapReduce on different platforms MapReduce on mobile , MapReduce on Azure