1 68 What Drives Big Data Analytics To Cloud? What Drives Big Data Analytics To Cloud? Easwar Krishna Iyer*, Sachin Sood, Neha Gupta & Tapan Panda** Abstract Data has been anointed as the Oil of the new millennium. From sourcing to storage, new-age data follows a complex path, a journey that often starts from the amorphous wisdom of the crowd social media and ends in the nebulous cloud space. The key transformational platform for data between sourcing and storage is data analytics. Analysis of data can be achieved using either traditional stand-alone computers or by moving data analytics to the cloud. Given the volume, velocity and variety of data, knowledge extraction from new-age data has a complexity, which one would not have associated with data processing a decade back. This paper posits that Big Data Analytics would slowly migrate towards Cloud Computing platforms and proceeds to find out the drivers that would accelerate this migration. The study aims at finding out the most significant drivers from the proposed list from the analytics user perspective. Keywords: Big Data Analytics, Cloud Computing, Migration Drivers, Business Analytics, Factor Analysis Introduction Data has been anointed as the Oil of the new millennium. Literature has gone to the extent of elevating data to the status of 4 th factor of modern production after land, labor and capital. Data processing and storage is getting aggregated today in the modern day temple called data centers, whose scale and complexity is mounting by the day. Based on ownership vs. utilization pattern, these data centers can be classified as captive and 3 rd party data centers. Starting from sourcing and ending * Great Lakes Institute of Management, Chennai, India. ** Great Lakes Institute of Management, Chennai, India.
2 International Journal of Consumer & Business Analytics 69 with storage, new-age data follows a complex path, a journey that often starts from the amorphous wisdom of the crowd social media and ends in the nebulous cloud computing space. Here are a few data snippets about data. 90% of the data in the world today was created in the last few years. The world will have over 40 times more data in 2020 that what it has today. Inanimate objects will start creating more data than human beings, thanks to the emergence of the Internet of Things. Some of the largest generators and aggregators of data did not exist a decade back. Facebook was launched in Tweets were born, thanks to Twitter in The first version of iphone was launched in IPad hit the market in In between, cloud, as a technology, topped the Gartner hype cycle in New units for data like Zeta bytes (10 6 x 10 6 x 10 9 bytes or a million million giga bytes) have been coined only very recently. The world touched its first zeta byte of data in 2010 and is poised to cross 30 zeta bytes in Digitization of the world, miniaturization of technology, rapid fall in hardware pricing due to economies of scale production, availability of free space in the cloud for data storage, tendency to create personal space information and uploading the same and rapid advancement in processing intelligence of computing machines all these have contributed to the explosion of data. Let us now move from data to big data. Big data is not just large and voluminous data. A finer understanding of big data is possible when one maps data along three independent vectors volume, velocity and variety. The volume or scale aspect of data was briefly touched upon in the previous paragraph. To quote an example that will highlight the relevance of data volumes, the digital data collected by Wal-Mart every hour is approximated to be the equivalent of 20 million filing cabinets worth of traditional text data. Obviously, the old way of data handling wouldn t work anymore. Coming to velocity, information and its utilization today is on a real time basis. From financial markets to retail to supply chain to traffic management, real-time analysis of data drives the critical mission objective. Despite the high relevance of volume and velocity, it is variety that makes big data and its analytics a completely different proposition. Data comes from hybrid sources like social sites, sensors, tweets, posts, blogs, GPS systems, smart phones, portals, online shopping patterns, credit card usage patterns and countless other sources. Today, we have location-aware data, person-aware data, context-aware data and the likes. In the context of understanding data better, one can add two more Vs to the three Vs (volume, velocity and variety) already mentioned above veracity of data and visualness of data. Veracity indicates the challenges in data interpretation, giving the probabilistic nature
3 70 What Drives Big Data Analytics To Cloud? of data and the uncertainty that goes with data deconstructing. Visualness is more of a presentation side aspect and indicates the richness of dash boards through which the final data can be presented. Two things become evident at this point. The first is that the amount of information that companies are dealing with today is growing at an exponential rate. Handling today s data in terms of throughput, analytics and storage would require massive processing power and huge storage cum retrieval systems. This would entail a high capital outlay. The second point is that any meaningful insights from this huge jungle of data can be achieved only by sophisticated data analytics. With data getting more and more unstructured, the tools required to process them are getting more sophisticated. Thus, over and above capital, the operating cost outlay in terms of manpower and other overheads is also going to get formidable. This is where the role of Cloud Computing comes in. The cloud computing environment - where analytics can be procured as a service - offers a cost effective alternative from the point of both capital and operating costs. The Cloud platform transforms analytics into a utility - much like electricity that can be accessed by anyone on a use-and-pay basis. The National Institute of Standards and Technology [NIST] defines cloud computing as a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This paper posits that the distributed, on-demand, self-service, location independent, elastic, pay-for-use only, zero CAPEX, zero ownership, utility driven cloud model will be the most effective platform for handling Big Data Analytics. Before proceeding to an in-depth literature survey, a peep into a few recent studies will help in putting big data in perspective. A study conducted on global business leaders by Economist reveals that nine out of ten of them consider data to be the fourth factor of production after the traditional land, labor and capital. In an increasingly virtualized corporate world, land will soon become a non-player. Productivity innovations coupled with strategic tools like outsourcing and offshoring has changed the dynamics of labor. This realignment of factors will pitch data - on par with capital - as the heart and soul of modern business. A Bloomberg study indicates that 97% of companies with revenues exceeding $ 100 million are using some form or the other of big data analytics.
4 International Journal of Consumer & Business Analytics 71 Literature Survey Big data and its analytics is a fairly broad and complex subject. So is Cloud Computing, as an enabler for the analytics delivery. Existing literature covers drivers that point out to Cloud being the next generation option for in-sighting and analyzing data. A ringside overview of these drivers is provided in this section. Data storage and its associated costs in cloud are discussed [Todd Weller, 2013] in an analyst interview with Wall Street. Demand for hosted storage and its implications are dealt with [Clive Longbottom, 2012] in his insightful article on how to make sense of the Big Data Universe. Another article [Christina Tamer, 2013] discusses about the flexibility that cloud platform offer to instantly scale up and scale down operations depending upon the load. This dynamic load variability is not possible with traditional on-premise systems. The said paper also talks about the high cost of security. Easwar et al  discuss and model the implications of both zero CAPEX as well as low OPEX for any cloud operations. Both of these aspects are cloud related hygiene factors and hence are applicable in the context of big data migration also. In a path breaking Big Data article in HBR, [Andrew McAfee, 2012] bring out the importance of speed / velocity in data processing. The same paper also talks about data aggregation possibilities where different streams of incoming data get aggregated over various hardware and software platforms. Hsinchun Chen et al  bring in the relevance of scalability in the context of data mining and information extraction. Another paper [Michael Ohata, 2012] drives home the importance of high end analytical tools to cull out intelligence from data. The paper also advocates the relevance of visual aids like richer dash boards to bring out all the dimensionalities of data. Many of the points mentioned above are touched upon [Mary Ann M Gobble, 2013] in a sum up paper on big data, which she cites as the next big thing in innovation. Her primal thrust is on scale and the complexity that is necessitated by the sheer scale of data. Technology existing as well as legacy in character and its upgradation is one of the themes proposed by Kreg Nichols et al  in his work Getting ahead in the Cloud. Irrespective of the nature of deployment, choosing the service model and deciding on the cloud adoption mix [SaaS vs. IaaS vs. PaaS] will involve an understanding of technology upgradation. Jeanne E. Johnson  talks about the convergence of the big data as well as big analytics spaces for creating a big business opportunity. Marc Walterbusch et al  reinforce the total cost of ownership perspective by creating a mathematical model for the calculation of the TCO of cloud computing services. Martin Courtney  sees the whole big data platform as a big jig-saw piece that needs to be set together. The paper avers that building connectors between
5 72 What Drives Big Data Analytics To Cloud? databases, data warehouses, processing platforms, application software and BI tools needs a modular approach. Richard Baillie  brings in the important element of green computing and energy conservation in the context of creating modern day data centers, which are significant energy consumers. Continuing the concept of green forward, Easwar et al  describe a model of a datacenter ecosystem whose energy consumption patterns have been studied as a function of incremental cloud adoption. Lorraine S. Lee et al  discuss about managing cost implications in the context of Cloud adoption. The paper clearly brings out the potential benefits that can be accrued by reducing the in-house IT resource costs, when things start moving to the cloud. Luminita Hurbean et al  bring in the synergy element between social networks, mobility, analytics and Cloud. Some commercial literature has described this convergence of technology with an acronym SMAC. The paper in mention talks about the value generation that lies in the cusp of these very different yet seamlessly integratable platforms. The entire big data s flow lies in its availability in social, accessibility via mobility, insight management using analytics and finally storage in the cloud. Douglas Eadline  in a e-article on HPC talks about vast computing resources with high computing performance getting available in the cloud space. HPC was originally in the realm of private clouds only. The paper talks of outstanding performance products coming to the public cloud space also. Shaoshan Liu et al  describes the two extremes in today s market the extreme high throughout machines at one end and the low power low speed mobile platforms at the other end. The paper talks about the necessity to create a heterogeneous architecture which blends HTC with all systems. To sum up, there is an existing body of literature available today which throws light on the various drivers that trigger the migration of big data analytics into cloud. This paper aggregates these drivers and studies the emerging patterns using factor analysis and regression. Description of Variables Before getting into factor analysis and then following it up with regression, a brief contextual description of the eighteen variables used in the study would be in order. Each of these variables is positioned as a driver which will accelerate the migration of Big Data Analytics into Cloud.
6 International Journal of Consumer & Business Analytics 73 Storage Space Storage space in cloud can be broadly divided into consumer level storage space and enterprise level storage space. The former still comes as a free or a freemium model. Drop Box, Google Drive, Sky Drive etc. are examples of such freemium models. In the context of Big Data Analytics, which is an enterprise level construct, storage and its possible configurations becomes a key cost savings enabler. Load Variability The computational load requirement of a firm will never be flat. Peaks and crests of activity is a given for most firms. From intraday peaking (stock markets) to seasonal peaking (holiday reservations), different activities will have different peak cycles. By dynamic provisioning and cloud bursting ( bursting into a public cloud when the internal capacity gets fully utilized), cloud platforms provide load variability as an intrinsic feature. No CAPEX Cloud relieves firms from initial CAPEX investments, be it for hardware or for expensive software. Its pay-as-you-use model transitions cloud buying from a CAPEX platform to a deferred OPEX platform. It is this CAPEX vs. OPEX tradeoff that changes Cloud from a technology offering to a business offering. Hence, the decision paradigm to adopt cloud is more often compelling economics rather than superior technology. Low OPEX The low OPEX of cloud is related to the low prices at which cloud bundles are available today. This looks more like a pricing strategy at this point rather than any inherent intrinsic feature of the cloud (unlike CAPEX). Cloud penetration, the world over, is only just looking up and to gain critical mass of adoption, cloud majors are probably aiming at a rock bottom pricing strategy. This penetration pricing of cloud offerings ought to continue till cloud reaches a significant critical mass. Speed of Data Processing Given the myriad nature of data and the high velocity with which it is varying, speed of processing becomes a critical success factor for any player who wants to analyze large amounts of data on a real time basis. In the context of cloud, speed takes on a new meaning given the inherent network latency effect of cloud. Any big data
7 74 What Drives Big Data Analytics To Cloud? analysis on cloud should factor in the latency aspect if real time output is a requirement. Scale of Data With IT infrastructure growing in tandem with business growth, scaling up IT is no longer an option, it is a business imperative. With scale bringing in its own associated complexities, the game of scaling up has gone beyond mere buying of IT infrastructure and stringing it together. Cloud, by its very construct, offers a quick change of scale. What is more is that, though up-scaling is viable for both traditional systems and cloud systems, down-scaling is possible only in cloud. Technology Upgradation In the context of a product buy, the onus of technology and its regular upgradation is on the buyer. The game changes when buying gets switched from product to service. Now, the onus of technology management, technology upgradation, obsolescence management and the likes moves from the buyer to the seller. With technology integration becoming more and more complicated, cloud offers a promise of technology upgradation and upkeep from the vendor side. The user can now afford to be technology agnostic and utilization focused. Access to Visual Aids Data by itself is of no tangible use, unless it is converted to information. Information again has only limited use, unless it is displayed in a meaningful manner. Visual aids and dash boards are powerful MIS tools used at the last mile of analytics engines to create a visual impact of data. A good dashboard is a key facilitator for management to review, query and analyze data. High End Analytical Tools If visual aids represent the visible front end of data outputting, analytical tools represent the invisible back end of data processing. Best of class business analytics tool providers have already started creating cloud versions of their offerings. Cost tradeoffs of using such high end tools in traditional vs. cloud scenarios is only emerging. Resource Cost Reduction There is enough empirical evidence to suggest that with higher cloud adoption, the costs of overall IT resources (manpower cost to run the IT show) will come down.
8 International Journal of Consumer & Business Analytics 75 The fall in IT resource cost is not going to be a natural outcome of cloud adoption. It has to be a strategically driven one. Costs like training costs, IT staff salary costs, supervisory staff salary costs, hiring costs and the likes can be brought down in steps when firms move more into the cloud space. Cost of Security Data Security, data confidentiality and data privacy are the biggest impediments in cloud adoption today. From multi-tenancy risk to accidental disclosure risk to SLA inadequacy risk to phishing risk, cloud represents multiple facets of risk. Existing literature quotes data security fears real as well as perceived as the biggest single factor that could inhibit cloud adoption. Green Computing ICT industry is one of the single largest CO 2 emission industries of the world. Greening the computing environment goes beyond the confines of a mere environmental issue. With cloud adoption, higher utilization will ensure lesser energy consumption and hence better greening. Modularized Procurement Cloud offerings have got a heterogeneity built into it. From application, to infrastructure, to software to storage to platforms, cloud can be packaged and procured in multiple modules. This modularity offers flexibility at one end while introducing an element of complication at the other as to which is the most effective modular combination that one can procure. Reduced TCO For startup firms, CAPEX constrained firms and cash strapped firms, the cost perspective of cloud adoption has to be from the total cost perspective. From procurement to maintenance to manpower to depreciation to cost of security breach, cloud has got a host of cost drivers that will eventually determine the overall total cost of operations. Understanding TCO and modeling its behavior for different levels of cloud adoption becomes a key business imperative. High Throughput Computing (HTC) HTC indicates the requirement of significant computing resources for long periods of activity. Sustainability, robustness, throughout capacity and reliability are indicators of a good HTC system. HTC is more of an efficiency indicator.
9 76 What Drives Big Data Analytics To Cloud? High Performance Computing (HPC) HPC on the other hand is an indicator of large amounts of computing power required for short periods of time. HPC is more of an effectiveness indicator. High Data Aggregation If speed indicates velocity and scale indicates volume, then data aggregation indicates the sheer variety of data this is required to be processed today. The heterogeneity of data, as indicated by variety, presents a vector more complicated than either volume or velocity. SMAC SMAC brings in the integration between four completely different platforms - Social, Mobile, Analytics and Cloud. The four platforms indicate where data is present, how it can be accessed, how / where it can be analyzed and finally how it can be stored. Methodology Using extensive literature survey, the authors have narrowed down on eighteen variables that could eventually drive the migration of Big Data into the Cloud Space. Their individual behavior is explained in the previous section. The study aims at finding out the most significant drivers from this posited list, as seen from the Big Data Analytics user perspective. The methodology adopted is quantitative and factor analysis, followed by multivariate linear regression is used for data analysis. A five point Lickert scale has been used with the option of a neutral / undecided stand in between. IBM SPSS Statistics 20.0 is used for data analysis. The survey is administered in online format. The respondent is someone who is currently working in the analytics space and is conversant with the cloud environment. The study has not tried to discriminate between analytics users in traditional space vs. analytics users who have recently moved some part of their job to cloud. Veteran analytics users in cloud platforms will not be there in the response set primarily because the combo possibility of analytics plus cloud is very new to the Indian market. All respondents are from some India-based analytics companies. Given the respondent mix, the response that has been mapped is an ex-ante response of potential cloud adopters. The answers do not reflect an ex-post opinion of a current user of Big Data in Cloud. A set of 100 responses were finally studied, after eliminating those that were either incomplete or had the same rank scale marked for all questions (standard deviation
10 International Journal of Consumer & Business Analytics 77 for their individual responses was 0.0). Before regression was done, a correlation matrix was generated for the eighteen independent variables to check possibilities of interdependence. The matrix shows multicollinearity and hence factor analysis was done before the final regression. Five factors emerge with Eigen values greater than 1.0. The Kaiser-Meyer-Olkin measure for sampling adequacy is and the Bartlett s Test of Sphericity yields an approximate Chi-square value of Both these measures are adequate and indicate the appropriateness of going for factor analysis. The factors have been extracted using Principal Component Analysis. The total cumulative variance explained by the five emerged factors is 59.87%. Varimax Rotation with Kaiser Normalization has been applied to ensure that no variable loads more than a single factor. All the eighteen variables with which the study started have got loaded into one factor or the other, indicating the relevance of the starting variables. As will be seen in Table 03, the attributes that add up to become one factor are randomly spread out within the questionnaire. As an example, attributes 1, 3, 4, 10 and 14 loads into the first factor. This proves that the factors that have emerged are not due to any priming at the questionnaire stage and there is no implicit memory effect in the question sequencing. The factor scores obtained for the five factors are added to the regression variable table. Regression is now performed using only the five emerged factors to find out the significant factors that drive the Big Data to Cloud migration. The F value of the ANOVA test is with significance of indicating the mathematical robustness of the results. The R 2 value obtained is only indicating that this is a vary nascent area of exploratory research. The purpose of the study was not to build in predictive elements. Hence, at an exploratory level, the results obtained are indicative of the emerging patterns in this new and nebulous market. The analysis and insights of our results are dealt with in the next section. Analysis and Interpretation Cost, Technology, Speed, Additional Investment and Value Adds emerge as the five independent vectors along which the respondent population thinks through when it comes to Big Data Analytics migrating into Cloud Platforms. This is the most important output of this study. The mapping of the eighteen variables (attributes) into the five emerged factors is shown in Table 01. As it can be seen, five attributes add up to give factor #1, four each for factor #2 and factor #3, three for factor #4 and the last two attributes for factor #5.
11 78 What Drives Big Data Analytics To Cloud? TABLE 1 ATTRIBUTE TO FACTOR MAPPING OF WHAT DRIVES BIG DATA TO CLOUD Multiple cost elements that integrate to an aggregate Cost Saving emerge as the most significant factor that the respondent market chooses for a possible analytics migration to cloud. The attributes that add up are storage cost, resource cost, CAPEX nullification, OPEX reduction and reduced overall TCO. With data getting complex and it s processing getting expensive, data analysts are clearly looking at the horizon for a low computation cost platform. In theoretical terms, a computational problem can be viewed as a series of instances, with each instance requiring a different solution. Respondents feel that the low cost promise of the cloud can possibly aid in bringing diverse computational solutions under one seamless roof. Cost is driven by scale. The economies of scale that an aggregate cloud player can bring to the market can never be replicated by individual players. From predictive analytics to prescriptive analytics, the types of analytic requirements are very large. Within this spectrum, there are multiple domains like marketing, forecasting, risk, financial services, supply chain and the likes to name a few. Every analytic space
12 International Journal of Consumer & Business Analytics 79 wants a reliable cost effective solution for their data management. This requirement is getting manifest in one simple word COST. Load variability, technology upgradation, high performance computing (HPC) and SMAC get added together in the respondent s mind to form the second factor, which we have named as Technology. The market has tended to add together elements that give a technology edge when we think of cloud. Up and down scaling of actual load, achieving high computational performance in short bursts of high power computing, meeting cutting edge technology upgradation and integrating social, mobile, analytics and cloud at the technology level these are the variables that have added together to give the factor TECHNOLOGY. Neither low cost nor high technology gives a feel of the speed of processing. The market has aggregated sheer speed, scale, volume, throughput and data aggregation into the third factor SPEED. This closely correlates with the velocity vector of Big Data management. TABLE 2 KMO AND BARTLETT S TEST Market senses that a few attributes will actually add on to cost in terms of extra investment. Security and Green are two clear candidates that will require an additional investment outlay. Market has added the attribute modularized procurement also into this basket. Any modular arrangement of buying will tend to create missing elements in the buying list. The procurement of those missing elements will entail an additional investment. Hence these three add up to give the fourth factor ADDITIONAL INVESTMENT. Analytics specific value adds like high end analytical tools and high end visual aids add up to complete the set of five factors. We have given the nomenclature to factor #5 as VALUE ADDS. Table 02 sums up the KMO and Bartlett s test coefficients of our study.
13 80 What Drives Big Data Analytics To Cloud? TABLE 3 ROTATED COMPONENT MATRIX The rotated component matrix which has been used for accurate factor delineation is shown in Table 03. Table 04 gives the output of the regression analysis that was done on the five emerged factors. As seen from the data, only two factors emerge as significant. These are
14 International Journal of Consumer & Business Analytics 81 Factor 01 (Cost) and Factor 03 (Speed). All the emerged factors have been mapped on a perception map with Ownership Focus to Utilization Focus mapped along the x axis and investment requirement mapped along the y axis. The perception map is shown in Figure 01. TABLE 4 REGRESSION COEFFICIENTS [ONLY THE FACTORS ARE USED AS VARIABLES] Figure 1 MAPPING OF FACTORS ON AN OWNERSHIP / UTILIZATION VS INVESTMENT PLOT
15 82 What Drives Big Data Analytics To Cloud? If one reads the two figures together [table 04 and figure 01], the inference is that the market segregates features into ownership centric features and utilization centric features. High end technology, additional investments and value adds are more the fixation of ownership centric buyers. Not for a moment do we want to dilute the importance of these vectors in the generic scheme of things. What we are trying to emphasize is that when the analytics game moves towards the cloud, the perspective of the respondent changes from ownership elements to utilization elements. The focus shifts to basic hygiene factors like cost and speed. One can perceive a similar analogy in the context of auto buying vs. auto leasing. Fuel injection technology, engine knocking reduction, high end music systems, retractable roofs and the likes would be the driving factors for auto buyers. In a sense, they represent value adds which call for a higher investment. But the auto leasing customer would simply focus on cost and speed of transportation. The vehicle seizes to matter, the need that it serves matters. Conclusion and Future Direction As a technology, Cloud Computing topped the Gartner hype cycle only as recently as Critical adoption of cloud is still a few years away and markets are waiting and watching for more referrals. In this context, the authors of this paper feel that the understanding of the significant drivers of Big Data Analytics migration to Cloud will go a long way in positioning the right elements of Cloud to the Big Data market. Big data is here to stay and so is Cloud. The sale scenario is a pure B2B sale scenario. A finer understanding of the analytics consumer mindset will help the cloud vendors in creating the right brochures, mailers, ads in sector specific magazines and other promotional campaigns that will help the nascent cloud technology to gain sufficient market traction in the emerging world of consumer and business data analytics. Another way to look at the whole evolving game is that the cloud industry, given its nascence state, is more focused today in creating the right technology back end. Scalability, encryption, compatibility, migration, data audit, data center location, scale and the likes are what keeps the industry preoccupied today. This paper invests attention on the buyer and not on the product. It shifts the game from the technology back end to the business front end. This paper tries to look at the crossover of analytics and cloud. Some of the authors of this paper are currently working on the crossover of mobility and cloud. There are interesting business models that emerge at the cusp of cloud computing and mobility platforms. Simultaneous work is also on in finding out what drives disruptive innovation in the context of cloud.