Trends in Cloud Computing and Big Data

Transcription

1 Trends in Cloud Computing and Big Data Nikita Bhagat, Ginni Bansal, Dr.Bikrampal Kaur Abstract - BIG data refers to the large amount of data which is generated and collected in an unstructured way. Big data provides the certain characteristics and technologies to address the challenges. Its Characteristics are often described as volume, velocity and variety. Due to the high volume and velocity of big data it is an effective option to store the big data on cloud because the cloud possess capabilities to store big data and process high volume of user access requests. Cloud Computing is a scalable model for delivering the resources and provides services on dem- -and through the network. It provides three types of services based models namely, Infrastructure a s a Service (IaaS), Platform as a Service (PaaS), Software as a Service (SaaS). Clouds can be public, private, hybrid or community. Amazon web services, Google App Engine are some of the cloud providers. Clouds are used to deal with big data to store and exploit the unstructured data. With this we can access the data anywhere and in a cost effective way. Hadoop, which is a tool provides a way to manage the big data. Keywords: Cloud Computing, Big Data, Iaas, Paas, Saas, Hadoop. I. INTRODUCTION Nowadays online work is increasing in each and every field, it leads to the development of model which is based on distributed computing. Due to which the concept of Cloud Computing concept has emerged. Cloud refers to the large pools of data which can be accessed easily from anywhere at anytime. This concept is not an entirely new concept as the idea dates back to time sharing systems of 1960s. The changing technology and business scenarios are key drivers for the development of this technology. Cloud computing is the Internet based computing where the shared resources, information are provided to computers and other devices on- demand. It provides the increased adoption of hardware virtualization technologies and service oriented architecture. Cloud Computing has attracted large number of companies like Google, Amazon and so on. On the other hand big data refers to large amount of data that needs to be managed. Its characteristics are defined in terms of three V's which are Volume, velocity and variety. LITERATURE REVIEW Cloud computing is an old concept. Over the past decade there has been a heightened interest for the adoption of cloud computing. It promises to reshape the needs of computing resources in a cost effective manner. Thus, Cloud Computing is an internet based IT service which offers comprised storage and infrastructure. Cloud computing is defined as the Everything as the service. In Cloud Computing, Cloud service provider (CSP) and cloud service consumer (CSC) plays the main role despite that there might be service brokers involved [1]. Thus, Cloud Computing is a model for enabling convenient, universal and on demand network access to shared pool of configurable computing resources network, server, storage, application and services) that can rapidly provisioned and released with minimum management effort or service provider interaction [2]. Big Data refers to datasets that grow so large and becomes So complex to work with using traditional data management systems. They are the data sets whose size is beyond the ability of commonly used software tools and storage systems to capture, store and process the data within time [3] Some of the motivating and driving factors of cloud Computing are - Economies of sale, Trend toward utility computing, advancement in technology, need for on demand provisioning of servers and for lowering the entry cost. II. CLOUD CLASSIFICATION Clouds can be classified into service model and deployment model. Service model depends on the cloud services being offered can be classified into Infrastructure as a Service (IaaS), Platform as a Service (PaaS) and Software as a Service (SaaS). On the other hand, deployment model depends on a cloud is setup. It can be classified as private, public, hybrid or community. Private cloud is operated by and for the individual entity. Public cloud is available to general public like a utility. When public and private cloud are used together it forms the hybrid cloud and community

2 Cloud is setup by and for a group having shared goals. Figure 1. Logical view of Cloud Computing A. Infrastructure as a Service (IaaS) provides the fundamental Computing resources like processing, storage etc. The users can deploy or run the arbitrary software eg. operating systems and applications. IaaS service model allow the limited control of networking components and full control of operating system. It is typically enabled via virtualization technologies. Example - Amazon Web Services (AWS) is a leading public IaaS cloud provider. It offers various services like computing, storage, databases etc. Amazon has played a key role as it ha initiated the efforts to rent out computing resources to external customers and launched AWS as utility computing in IaaS Cloud Characteristics 1.It provides bare bones computing infrastructure. 2. Cloud user is responsible for installing all software on virtual machine. 3. It allows resource utilisation monitoring and reacting to events. By far it is the most flexible cloud variant as the user can Configure the virtual machine and software stack. PaaS Cloud Characteristics 1. Allows only provider supported programming languages, tools, APIs and components for building applications. 2. No control of underlying infrastructure. 3. Can only control deployed application and possibly its hosting environment configurations. 4. Effort needed to setup/management is lower than IaaS. C. Software as a Service (SaaS), it is the environment where application developed by the developer is provided to the user and then user uses it as per his demand. Example Google Apps like gmail, calendar etc. Vmware cloud foundary offers a range of application development framework. SaaS Cloud Characteristics - 1. It has no control of underlying infrastructure such as network, servers, operating systems and storage. 2. Allows control of a limited set of user specific application configuration settings. 3. No programming is needed. B. Platform as a Service (PaaS), is the second service model which provides the platform to deploy the application. In PaaS only the consumer is responsible for writing the application code. Also, the multiple users can share the platform. Example- Google App Engine (GAE) is a leading PaaS cloud available to public and offers several services to developers. GAE supports for writing of apps in different programming Languages. CLOUD PROVIDES 1. Commercial Amazon EC2 (Compute), S3(Storage) Microsoft Azure Google App Engine 2. Open Source CloudStack Virtualization in the Cloud Computing is described as giving a virtual rather than actual instance of a machine. It is a form of abstraction where hardware entities as software and also Partitioning of the host into logically isolated environments. It is offered via special software layer as putting directly over the hardware called hypervisor or VMM or also done via OS. Virtualization is mainly done for better resource utilization, to harness hardware capabilities and for better management.

3 characters per tweet, the high velocity (or frequency ) of Twitter data ensures large volumes (over 8 TB per day). III. BIG DATA From the past few years the term 'big data' is widely used in information technology industry. Big data was introduced to define a large amount of data which the traditional data management techniques cant able to manage. Various problems such as collection of data, data storage cant be solved by traditional technologies. As digitisation is increasing, large amount of data is being generated. Even in one minute there is lot of data gets generated in the internet which is beyond our thinking. To manage this large amount of unst- -ructured data various tools are used. For various activities in an organization like decision making, there is a need to mine this big data to analyze the current data. The large sets of unstructured data such as production, output records and weather data require big data analysis to bring order and to see trends, patterns. 3. Variety - Defines the data type. It refers to the different types of data collected via sensors, smartphones or social networks. Such data types include video, image, text, audio, and data logs, in either structured or unstructured format. Most of the data generated is in unstructured format. As more services are added, new sensors deployed, new data types are needed to capture the resultant information. Big Data refers to the technologies and initiatives that involves data which is too diverse, fast - changing or massive for conventional technologies, skills and infrastructure to address efficiently. Big data requires different techniques, tools to solve the data problem. Its main aim is to solve the problems in a better way. Big data refers to the following types of data: Traditional enterprise data It includes the customer information from CRM systems, transactional ERP data, web store transactions, and general ledger data. Machine-generated /sensor data It includes Call Detail Records ( CDR ), weblogs, smart meters, manufacturing sensors, equipment logs (often referred to as digital exhaust), trading systems data. Social data It includes customer feedback streams, microblogging sites like Twitter, social media platforms like Facebook. Figure2. Characteristics of Big Data A. Characteristics IV. HADOOP AND MAP REDUCE 1. Volume defines the Data quantity. It is the Machine - generated data that is produced in larger amount than non traditional data. For instance, a typical PC might have had 10 gigabytes of storage in last year, Facebook has 500 terabytes of new data every day; an aircraft company generates terabytes of flight data during a single flight. 2. Velocity - defines the Data speed. The contents of data constantly change because of the absorption of complementary data collections, introduction of previously archived data and streamed data which is arriving from multiple sources. Social media data streams produces a large entry of opinions and relationships. Even at the fixed Hadoop is one of the most widely used platforms for the development of various apps. It is the open source software framework. Hadoop is java based and gives set of algorithms for the distributed processing of data sets. Hadoop provides the data motion and reliability to applications. Hadoop was derived from the Google File System and Google s MapReduce [4 ]. The paradigm named MapReduce is implemented using Hadoop. MapReduce Programming Model - Hadoop is an open - source distributed platform that is supported by Yahoo and is used by Amazon and a number of other companies. To achieve parallel execution, Hadoop implements Map Reduce model. This

4 programming model is implemented by many other cloud platforms as well. MapReduc is a distributed divide - and - conquer programming model that consists of two phases: a massively parallel Map phase, followed by an aggregating Reduce phase. The input data of Map Reduce is broken down into a list of key/value pairs. Mappers which are the processes assigned to the Map phase; accepts the incoming pairs, process them in parallel and generate intermediate key/value pairs. Data is critical in the healthcare industry where it documents the history and evolution of a patient s illness and care, giving healthcare providers the tools they need to make informed treatment decisions. With medical image archives growing by 20 to 40 percent annually, by 2015, an average hospital is expected to generate terabytes of data. If large sets of medical data were routinely collected and electronic health records were filled with high-resolution images, data could be better predicted and provides efficiency and quality. MapReduce is used to process large data sets. The problem has to be data parallelizable. It leverages locality of data to avoid transmission Overheads. In Map step ; Large input dataset is split into smaller ones And smaller sets are distributed to worker nodes, further worker nodes report the result to master node.whereas in reduce step, master node combines the received results to generate the output. Apache Hadoop handles several low level cluster management tasks. Relationship With Cloud - Technically, there is no dependency between Map Reduce and cloud. However, Map Reduce implementat- -tions work well both on cloud and non cloud environments. It can immensely benefit from elasticity of cloud resources and dynamic scaling is easier on cloud. V. BIG DATA APPLICATIONS It refers to large scale distributed applications which works with large data sets. Data analysis is difficult in big data, but various applications like Google s Map Reduce and Apache Hadoop are used to solve big data problems. Big Data provides an infrastructure for transparency in manufacturing industry which solves several uncertainties. In these big data applications the conceptual framework begins with acquisition of data where different types of sensory data is acquired. The combination of sensory data and historical data constructs the big data in manufacturing. The big data generated from combination acts as input into predictive tools and preventive strategies such as health management. The analysis in the healthcare domain are multifaceted. Remote patient monitoring, an emerging market segment of machine to - machine communications (M2M), is proving a source of useful information. People with diabetes, for instance, are at risk of long-term compl- -ications such as kidney disease etc. Big data is suitable for use with public services, because it is based on mass analysis of public transport, avoiding issues with privacy and the use of personal data. Better decisions on public transport can be made justified by evidence for improving the efficiency of the service, transparency, choice and accountability. Big data stands in stark contrast to data avoidance and data minimization, the two basic principles of data protection. Big data facilitates the tracking of people s movements, behaviors and preferences, and in turn, helps to predict an individual s behavior with unprecedented accuracy, often without the individual s consent. For instance, electronic health records and real time self- quantification may constitute an enormous step forward in streamlining the prescriptions of drugs or diet and fitness plans. Big data is different in many terms as it is automatically generated by a machine since sensors are embedded in an engine. Also, big data is not designed to be friendly and is an entirely new source of data. Different sources of big data can be users, application, systems and sensors. A. Major Open Problems - Big data is becoming an invisible gold mine due to the potential value it contains. Some of the major open problems that must be addressed to ensure the success of data management systems in the cloud. Different systems target different aspects in the design space, and multiple open problems still remain. With respect to Key - Value stores, though these systems are popular, they only support very simple functionality. Providing support for ad-hoc querying on top of a Key Value store or providing consistency, guarantees at different access granularities. On the other hand, in the domain of relational database management, an important open problem is how to make the systems elastic for effectively utilizing the available resources and minimizing the cost of operation. Furthermore, characterizing the different consistency semantics that can be provided at different scales, and effective techniques for load balancing are also critical aspects of the system. Designing scalable, elastic, and autonomic multitenant database systems is another important challenge that must also be addressed. The big data problems 1. Problem of Speed various import and export problems. The traditional data management schemes uses centralized storage methods. For importing and exporting large data from centralized storage its performance declines. 2. Type and structure Large number of models developed earlier have fixed patterns. With rapid development data is increasing and its format is not fixed. So different models of data processing are used and integrated with structure and unstructured data whose type source and structure is different. 3.Cost- Cost difference between mainframe and pc servers. Hardware devices are very expensive. cost difference often occurs. 4.Security and Privacy- structured and non structured data. For structured data there are lot of mechanisms for storage, security. As the amount of data is increasing centralized storage and processing are shifting to distributed processing. Security is prevention of data loss, so it requires back up and redundancy mechanisms so that data will never be lost. Security also refers to protecting data from unauthorized access. For large amount of data unified security access control mechanisms must

5 be constructed so privacy problem are closely associated with big data. 5. Data Sharing- defines data standards and interface. VI. BIG DATA AND CLOUD Since the data is too large to process and takes more time to transfer anywhere, so cloud computing provides the solution for this problem by making available all the data sets in the public cloud which can be accessed from anywhere and is also cost efficient. The big data environments requires clusters of servers to support the tools which processes the large volumes, high velocity, and varied formats of big data. Most of the enterprises are looking forward to cloud computing as the structure to provide support for their big data projects. Analyzing the data from anywhere makes big data in the Cloud more appealing in terms of cost and time. Data storage using cloud computing is more feasible option for small to medium sized organisations. It provides an environ- -ment to implement the big data technology. Cloud computing and big data are corelated with each other. Big data provides the ability to process distributed queries across multiple data sets and return result in a timely manner. Whereas cloud computing provides the underlying engine through the use of Hadoop. Big data utilizes distributed storage technology based on cloud computing rather than local storage attached to a computer or electronic device. Big data evaluation is driven by fast-growing cloud - based applications developed using virtualized technologies. Therefore, Cloud Computing not only provides facilities for the computation and processing of big data but also serves as a service model. The Cloud Computing infrastructure can serve as an effective platform to address the data storage required to perform big data analysis. Cloud Computing is correlated with a new pattern for the provision of computing infrastructure and big data processing method for all types of resources available in the cloud through data analysis. Several cloud- based technologies are needed for this environment because dealing with big data has become compilcated. Ex- Map Red uce is used for data processing in the cloud environments. Big Data needs large on - demand compute power and distributed storage to crunch the 3V data problem and Cloud seamlessly provides this elastic on demand compute required for the same. Cloud has represented the "Asa-Service" Model by hiding the complexity and challenges involved in building a scalable elastic application. The same is the requirement for Big Data Processing. Hadoop hides the complexity of the large scale distributed processing in the same way. Thus the simplification provided by Cloud and Big data is the prime reason for the mass adoption of Big Data and Cloud. The simplification provided by the combination of Cloud and Big Data can increase the adoption of a complex problem of large scale distributed processing. A. Drivers for big data on cloud adoption 1. Cost reduction: Cloud Computing offers a cost-effective way to support big data technologies and the advanced analytics applications that can drive business value. Organizations are constantly looking to analyse the hidden data and patterns. Big data environments require the clusters of servers to support varied tools. Thus, Cloud Computing can be taken as the structure to save costs with the cloud s pay - per - use model and their characteristics. 2. Reduce overhead: Various components and integration are required for any big data solution implementation. With cloud computing, these components can be automated, which reduces complexity and improves the productivity of a team. 3.Rapid provisioning/time to market: Provisioning servers in the cloud is as easy as buying something on the Internet. Big data environments can be scaled up or down easily based on the processing requirements. Faster provisioning is important for big data applications because the value of data reduces quickly as time goes by. 4.Flexibility/scalability: Big data analysis, say in the life sciences industry, requires huge computing power for a brief amount of time. For this type of analysis, servers need to be provisioned in minutes. This kind of scalability and flexibility can be achieved in the cloud, replacing huge investments on super computers with simply paying for the computing on an hourly basis. Figure 3. Cloud Computing usage in Big Data Both Cloud and Big Data is about delivering value to enterprise by lowering the cost. Cloud implements this through the Pay-per user model. Big Data and Cloud has been driving for the lowering of cost for the enterprise and their goal is to bring value to the enterprise. Cloud and Big Data brings in data security and privacy

6 concerns. Cloud and Big Data are used together within the organisation to build Elastic Scalable Private Cloud Solution. Big data has co-occurred with the quick adoption of the PaaS and IaaS technologies. Iaas allows the rapid deploymen t of the computation nodes while Paas lets firms scale their capacity on demand and reduce costs. The Big Data processing for enterprises of all sizes are empowered by the cloud by reducing the number of problems, but there exist still some complexities in the collection of the data from the big unstructured data. Cloud computing [4] democratizes big data means any enterprise can now work with unstructured data at a huge scale. Cloud Computing by bringing the big data analyses has provided users with affordable and flexible to amounts of computing resources on demand. Cloud reduces the overall production cost and Cloud computing provides enterprises cost - effective, flexible access to big data s enormous magnitudes of information. Big data on the cloud generates vast amounts of on - demand computing resources that comprehend best practice analytics. Both technologies will continue to evolve and congregate in the future. VII. CONCLUSION [4] Dean J, Ghemawat S, MapReduce: simplified data processing on large clusters, Communications of the ACM, Volume 51 Issue 1, Pages , New York USA, ACM, [5] Strauch S, Kopp O, Leymann F, Unger T, A Taxonomy for Cloud Data Hosting Solutions, Published in Dependable, Autonomic and Secure Computing (DASC), Sydney, pp , [6] Prodan R, Sperk M, Scientific computing with Google AppEngine, Future generation computer systems, Elsevier, [7 ] Shvachko K, Hairong K, Radia S, Chansler R, The Hadoop distributed file system, Mass Storage Systems an d Technologies (MSST), pp. 1-10, symposium, [8] Lohr S., Feb 11, 2012, "The Age of Big Data", New York Times, / /sunday-review/big-datasimpact-in-theworld.html [9] Lynch M., Nov 13, 2012, "Barack Obama's Big Data won the US election", Computerworld, /article/ / Barack_Obama_39_s_Big_Data_won_the_US_election [10] Waldspurger CA, Memory resource management in VMware ESX server, ACM SIGOPS Operating Systems Review - OSDI 02, Ne wyork USA, Vol. 36, Issue SI, pp , Cloud computing is a technology which is being used at large level by industries to capture the potential opportunities. On the other hand large amount of unstructured data exists and needs to be managed. Cloud computing and big data together deals with the situation for data processing. With big data, analysts have not only more data to work with, but also the processing power to handle large numbers of records with many attributes. The traditional machine learning uses statistical analysis based on a sample of a total data set. The ability to do very large numbers of records and very large numbers of attributes per record and that increases predictability. The combination of big data and compute power also lets analysts to explore new behavioral data throughout the day, such as websites visited or location. It was called as sparse data, because to find something of interest lot of data is searched. It provides the cost reduction and easy availability of data. REFERENCES [1] Hogan, M., Liu, F., Sokol, A., Tong, J.: NIST Cloud Computing Standards Roadmap (2011 ), /customcf/ get_pdf.cfm? pub_id= [2] Foster, Y. Zhao, S. Lu, Cloud computing and Grid computing 360- degree compared, Proc. Grid Computing Environments Workshop (GCE 08), [3] Kubick, W.R.: Big Data, Information and Meaning. In: Clinical Trial Insights, pp (2012)