Big Data Analytics with Hadoop on Cloud for Masses Rupali Sathe,Srijita Bhattacharjee Department of Computer Engineering Pillai HOC College of Engineering and Technology, Rasayani A B S T R A C T Businesses have greatly benefited from data analytics. Companies analyze activities such as sales, marketing, fraud and risk management, inventory optimization, and customer support to improve their strategic and tactical business decisions. However, analytics that is powerful enough to work with big data has been too expensive, complex, and compute- and resourceintensive for smaller companies to use. Now, though, vendors and service providers are leveraging advanced data-processing, network, and cloud technologies to make big analytics suitable for use by such organizations. Market research firm IDC said the big data market is growing about 40 percent annually, with revenue expected to rise from $3.2 billion in 2010 to $17.2 billion in 2015. Hadoop as Big Data storage platform. The idea arises from the expensive Big Data analytics and storage platform. How it would be available to small business even though they don t have nearly the scope of operations found in bigger organizations. When it deployed in cloud as a service on demand this way provide a one solution when we are talking about terabytes of data. Companies like Amazon, EC2, and Rackspace are giving cloud based deployment facility of Hadoop. They allow direct upload or direct connection from client website in our case clickstream data can be uploaded or posted directly from web site web server. Index Terms: Hadoop, Clickstream, I. INTRODUCTION Businesses have greatly benefited from data analytics. Companies analyze activities such as sales, marketing, fraud and risk management, inventory optimization, and customer support to improve their strategic and tactical business decisions. However, analytics that is powerful enough to work with big data has been too expensive, complex, and compute- and resource-intensive for smaller companies to use. Now, though, vendors and service providers are leveraging advanced data-processing, network, and cloud technologies to make big analytics suitable for use by such organizations. Some vendors provide analytics software, while others offer services via the cloud. The stakes are enormous. Market research firm IDC said the big data market is growing about 40 percent annually, with revenue expected to rise from $3.2 billion in 2010 to $17.2 billion in 2015. Big Data introduction, what is why and how the big data, Challenges with big data storage and faster real time analytics. Hadoop as Big Data storage platform. The idea arises from the expensive Big Data analytics and storage platform. How it would be available to small business even though they don t have nearly the scope of operations found in bigger organizations. When it deployed in cloud as a service on demand this way provide a one solution when we are talking about terabytes of data. Companies like Amazon, EC2, and Rackspace are giving cloud based deployment facility of Hadoop. They allow direct upload or direct connection from client website in our case clickstream data can be uploaded or posted directly from web site web server. Cloud-based analytics either accepts uploaded data from a company or connects directly to its website or back-end service. Smaller companies haven t used the cloud-based approach for big analytics until now in part because the cloud-based services marketplace is relatively new. It hasn t always been easy either for smaller companies to transmit large amounts of data over networks or for service providers to return analytics quickly. Technical advances, such as faster networks and processors, have helped with this. II. BACKGROUND OVERVIEW 1 2015, IJAFRSE and VIVRUTI 2015 All Rights Reserved www.ijafrse.org
As the small companies and big analytics explains small and mid-sized organizations could benefit from analytics as much as large one. Such companies are struggling with data volumes growing 62 percent annually according to market research firm. However, these businesses haven t been able to benefit from high powered analytics and therefore can t make the most of their information Taking advantage of big data often involves a progression of cultural and technical changes throughout our business, from exploring new business opportunities to expanding sphere of inquiry to exploiting new insights as we merge traditional and big data analytics. The journey often begins with traditional enterprise data and tools, which yield insights about everything from sales forecasts to inventory levels. The data typically resides in a data warehouse and is analyzed with SQL-based business intelligence (BI) tools. Much of the data in the warehouse comes from business transactions originally captured in an OLTP database. These planning and forecasting applications can benefit from big data but organizations need advanced analytics to make this goal a reality. For more advanced data analysis such statistical analysis, data mining, predictive analytics, and text mining, companies have traditionally moved the data to dedicated servers for analysis. Exporting the data out of the data warehouse, creating copies of it in external analytical servers, and deriving insights and predictions is time consuming. It also requires duplicate data storage environments and specialized data analysis skills. Once you ve successfully built a predictive model, using that model with production data involves either complex rewriting of the model or the additional movement of large volumes of data from a data warehouse to an external data analysis server. III. APPLICATION AREAS We have recently crossed a threshold beyond which mobile technology and social media are generating datasets larger than humans can comprehend. The growth of data will never stop according to IDC Digital Universe study. In particular, three market areas are providing the bulk of the emerging applications. Above figure shows where to expect most application activity 1) Social Media 2) Sensor Network 3) Commercial industries 4) Mobile Technology Today, we re surrounded by data. People upload videos, take pictures on their cell phones, text friends, update their Facebook status, leave comments around the web, click on ads, and so forth. Machines, too, are generating and keeping more and more data. The exponential growth of data first presented challenges to cutting-edge businesses such as Google, Yahoo, Amazon, and Microsoft. They needed to go through terabytes and petabytes of data to figure out which websites were popular, what books were in demand, and what kinds of ads appealed to people. Web log files track the movement of visitors to a website, revealing who clicked where and when. This data can reveal how people interact with your site. Social media helps to understand what people are thinking or how they feel about something. It can be derived from web pages, social media sites, tweets, blog entries, email exchanges, search indexes, click streams, equipment sensors, and all types of multimedia files including audio, video, and photographic. This data can be collected not only from computers, but also from billions of mobile phones, tens of billions of social media posts, and an everexpanding array of networked sensors from cars, utility meters, shipping containers, shop floor equipment, point of sale terminals and many other sources. Most of this data is less dense and more information poor, and doesn t fit immediately into your data warehouse. 2 2015, IJAFRSE and VIVRUTI 2015 All Rights Reserved www.ijafrse.org
Fig 1. Big Data Scenario IV. Why Hadoop on cloud? Depending on your data processing needs, your Hadoop workload can vary widely over time. You may have a few large data processing jobs that occasionally take advantage of hundreds of nodes, but those same nodes will sit idle the rest of the time. You may be new to Hadoop and want to get familiar with it first before investing in a dedicated cluster. You may own a startup that needs to conserve cash and wants to avoid the capital expense of a Hadoop cluster. In these and other situations, it makes more sense to rent a cluster of machines rather than buy it. The general paradigm of provisioning compute resources as a remote service in a flexible, costeffective manner is called cloud computing. The best-known infrastructure service platform for cloud computing is Amazon Web Services (AWS). You can rent computing and storage services from AWS on demand as your requirement scales. As of this writing, renting a computational unit with the equivalent power of a 1.0 GHz 32-bit Opteron with 1.7 GB RAM and 160 GB disk storage costs $0.10 per hour. Using a cluster of 100 such machines for an hour will cost a measly $10! It wasn t too long ago that only a select few had access to clusters on this scale. Thanks to AWS and other such services, large-scale compute power is available to many people today. Because of its flexibility and cost effectiveness, running Hadoop on the AWS cloud is a popular setup. V. THE PROPOSED SYSTEM A. System Overview The above proposed system will be divided into the following sub modules or sections Fig 2. System Block Diagram 3 2015, IJAFRSE and VIVRUTI 2015 All Rights Reserved www.ijafrse.org
B. Block Diagram The Block Diagram of the system is attached. C. Explanations of Blocks Map Reduce A distributed data processing model and execution environment that runs on large clusters of commodity machines. HDFS A distributed file system that runs on large clusters of commodity machines. Pig A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and Map Reduce clusters. Hive A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to Map Reduce jobs) for querying the data. HBASE A distributed, column-oriented database. HBASE uses HDFS for its underlying storage, and supports both batch-style computations using Map Reduce and point queries (random reads). Zookeeper A distributed, highly available coordination service. Zookeeper provides primitives such as distributed locks that can be used for building distributed applications. Sqoop A tool for efficient bulk transfer of data between structured data stores (such as relational databases) and HDFS. Oozie A service for running and scheduling workflows of Hadoop jobs (including Map- Reduce, Pig, Hive, and Sqoop jobs). VI. PROJECT DEVELOPMENT METHODOLOGY OR STEPS The following will be development steps so as to achieve the working Prototype Model of the above proposed system. Project technical details: Operating System: Linux Technology Platform: Hadoop 2.0 or 1.0 Cloud (AWS/Rackspace) 4 2015, IJAFRSE and VIVRUTI 2015 All Rights Reserved www.ijafrse.org
Hue Pig Hive Languages: Pig Latin Java Backend: Hbase International Journal of Advance Foundation And Research In Science & Engineering (IJAFRSE) VII. WORKING FLOW OF THE SYSTEM The followings are the various steps for the working flow of our proposed system The system will be consisting of the following major sections 1. Set up Hadoop Development Environment 2. Hadoop Cluster Development 3. Installation of Pig on Hadoop Custer 4. Configure Hadoop Cluster with Pig 5. Cloud Virtual Image development 6. Uploading Data to hadoop Cluster 7. Upload hadoop Cluster on Cloud 8. Install and Configure Hue 9. Access data from Hadoop Cluster on cloud to External Web site 10. Perform Click Stream Analytics on Big Data Fig 3. System Deployment Diagram VIII. Component of System A. HDFS HDFS has many similarities with other distributed file systems, but is different in several respects. One noticeable difference is HDFS's write-once-read-many model that relaxes concurrency control requirements, simplifies data coherency, and enables high-throughput access. Another unique attribute of HDFS is the viewpoint that it is usually better to locate processing logic near the data rather than moving the data to the application space. HDFS rigorously restricts data writing to one writer at a time. Bytes are always appended to the end of a stream, and byte streams are guaranteed to be stored in the order written. HDFS has many goals. Here are some of the most notable: Fault tolerance by detecting faults and applying quick, automatic recovery. B. Map Reduce MapReduce is the second base component of a Hadoop framework. It manages the data processing part on the Hadoop cluster. A cluster runs MapReduce jobs written in Java or any other language (Python, 5 2015, IJAFRSE and VIVRUTI 2015 All Rights Reserved www.ijafrse.org
Ruby, R, etc.) via streaming. A single job may consist of multiple tasks. The number of tasks that can run on a data node is governed by the amount of memory installed. It is recommended to have more memory installed on each data node for better cluster performance. Based on the application workload that is going to run on the cluster, a data node may require high compute power, high disk I/O or additional memory. MapReduce programs are designed to compute large volumes of data in a parallel fashion. This requires dividing the workload across a large number of machines. This model would not scale to large clusters (hundreds or thousands of nodes) if the components were allowed to share data arbitrarily. The communication overhead required to keep the data on the nodes synchronized at all times would prevent the system from performing reliably or efficiently at large scale. Instead, all data elements in MapReduce are immutable, meaning that they cannot be updated. If in a mapping task you change an input (key, value) pair, it does not get reflected back in the input files; communication occurs only by generating new output (key, value) pairs which are then forwarded by the Hadoop system into the next phase of execution Hadoop Cluster Modes Standalone (or Local) Mode No daemons, everything runs in a single JVM. Suitable for running Map Reduce programs during development Has no DFS. Pseudo-Distributed Mode Hadoop daemons run on the local machine Fully-Distributed Mode Hadoop daemons run on a cluster of machines. C. Results It will analyses two aspects of the result that is Hadoop cluster is a set of commodity machinenetworked together in one location. Data storage and processing all occur within this cloud of machines. Different user can submit computing jobs to hadoop from individual clients which can be their own desktop machines in remote locations from the Hadoop cluster. Here we have achieved Hadoop cluster in standalone mode that everything runs in a single JVM. Which provides the system with scale out huge monolithic servers scale up and reduces the price performance of current technology. Before looking at the results, it is important to understand a few points. Statistical integrity was not achieved in this due to the cost to run the jobs. However, due to the stability and consistency of the Cloud infrastructure and how the tests were performed, there is significant value in the findings. IX. CONCLUSION The primary goal of big data analytics is to help companies make better business decisions by enabling and other users to analyze huge volumes of transaction data as well as other data sources like clickstream data that may be left untapped by conventional business intelligence programs. Handling big data, requires a lot of storage and good networks. And analytical techniques such as machine learning use considerable compute power. However, the cloud will help make big analytics possible for smaller 6 2015, IJAFRSE and VIVRUTI 2015 All Rights Reserved www.ijafrse.org
businesses by letting service providers handle the analytics. Customers can use today s inexpensive storage to collect large amounts of data, which they can then send to providers via widely available highbandwidth networks. However, the cloud has helped make big analytics possible for smaller businesses by letting service providers handle the analytics. And smaller organizations that want to run analytics software themselves can do so on more powerful servers than in the past because of technical advances in-lower-cost multicore processors. However, cloud-based services have become the more popular route because they re scalable and relatively inexpensive, and they don t require a company to have employees with data analytics-related skills. X. REFERENCES [1] A. Kumbhare, Y. Simmhan, and V. Prasanna, Cryptonite A Secure and Per formant Data Repository on Public Clouds, Proc. IEEE Cloud Computing Conf., IEEE CS, 2012, pp. 510 517. [2] PENG Junjie, ZHANG Xuejun, LEI Zhou, et al. Comparison of Several Cloud Computing Platform[c] Proceedings of Second In ternational Symposium on Information Science and Engineering: December 26-28, 2009. Shanghai, China, 2009: 23-27.. [3] Stephen Kaisler, Frank Armour, J. Alberto Espinosa, William Money, Big Data: Issues and Challenges Moving Forward, IEEE, 46th Hawaii International Conference on System Sciences, 2013. [4] Yuri Demchenko, Zhiming Zhao, Paola Grosso, Adianto Wibisono, Cees de Laat, Addressing Big Data Challenges for Scientific Data Infrastructure, IEEE, 4th International Conference on Cloud Technology and Science, 2012. [5] Matthew Smith, Christian Szongott, Benjamin Henne, Gabriele von Voigt, Big Data Privacy Issues in Public Social Media, IEEE, 6 th International Conference on Digital Ecosystems Technologies (DEST), 18-20 June 2012. 7 2015, IJAFRSE and VIVRUTI 2015 All Rights Reserved www.ijafrse.org