A B S T R A C T. Index Terms: Hadoop, Clickstream, I. INTRODUCTION



Similar documents
BIG DATA TRENDS AND TECHNOLOGIES

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Big Data: Tools and Technologies in Big Data

Implement Hadoop jobs to extract business value from large and varied data sets

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Are You Ready for Big Data?

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

UNDERSTANDING THE BIG DATA PROBLEMS AND THEIR SOLUTIONS USING HADOOP AND MAP-REDUCE

Hadoop Big Data for Processing Data and Performing Workload

How To Scale Out Of A Nosql Database

How To Handle Big Data With A Data Scientist

Map Reduce & Hadoop Recommended Text:

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

White Paper: What You Need To Know About Hadoop

The Next Wave of Data Management. Is Big Data The New Normal?

BIG DATA CHALLENGES AND PERSPECTIVES

Data Mining in the Swamp

Workshop on Hadoop with Big Data

Chapter 7. Using Hadoop Cluster and MapReduce

Application Development. A Paradigm Shift

A Brief Outline on Bigdata Hadoop

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Big Data on Microsoft Platform

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Data processing goes big

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Are You Ready for Big Data?

Hadoop. Sunday, November 25, 12

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

So What s the Big Deal?

NoSQL and Hadoop Technologies On Oracle Cloud

Manifest for Big Data Pig, Hive & Jaql

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Market Basket Analysis Algorithm on Map/Reduce in AWS EC2

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Big Data: Issues, Challenges, Tools and Good Practices

Big Data and Industrial Internet

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Testing Big data is one of the biggest

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

Bringing Big Data to People

Hadoop IST 734 SS CHUNG

White Paper: Hadoop for Intelligence Analysis

Data-Intensive Computing with Map-Reduce and Hadoop

Accelerating and Simplifying Apache

Big Data Explained. An introduction to Big Data Science.

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Data Refinery with Big Data Aspects

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

A Survey on Big Data Concepts and Tools

Chase Wu New Jersey Ins0tute of Technology

Keywords: Big Data, Hadoop, cluster, heterogeneous, HDFS, MapReduce

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Apache Hadoop: The Big Data Refinery

Hadoop Ecosystem B Y R A H I M A.

HDP Hadoop From concept to deployment.

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

The Future of Data Management

ITG Software Engineering

Big Application Execution on Cloud using Hadoop Distributed File System

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop and Map-Reduce. Swati Gore

A Brief Introduction to Apache Tez

Big Data, Big Traffic. And the WAN

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Modernizing Your Data Warehouse for Hadoop

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Big Data With Hadoop

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop

Big Data Analytics - Accelerated. stream-horizon.com

Dell In-Memory Appliance for Cloudera Enterprise

Big Data and Market Surveillance. April 28, 2014

Building your Big Data Architecture on Amazon Web Services

Cloudera Certified Developer for Apache Hadoop

GeoGrid Project and Experiences with Hadoop

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Leveraging Public Clouds to Ensure Data Availability

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

How To Extend An Enterprise Bio Solution

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Transcription:

Big Data Analytics with Hadoop on Cloud for Masses Rupali Sathe,Srijita Bhattacharjee Department of Computer Engineering Pillai HOC College of Engineering and Technology, Rasayani A B S T R A C T Businesses have greatly benefited from data analytics. Companies analyze activities such as sales, marketing, fraud and risk management, inventory optimization, and customer support to improve their strategic and tactical business decisions. However, analytics that is powerful enough to work with big data has been too expensive, complex, and compute- and resourceintensive for smaller companies to use. Now, though, vendors and service providers are leveraging advanced data-processing, network, and cloud technologies to make big analytics suitable for use by such organizations. Market research firm IDC said the big data market is growing about 40 percent annually, with revenue expected to rise from $3.2 billion in 2010 to $17.2 billion in 2015. Hadoop as Big Data storage platform. The idea arises from the expensive Big Data analytics and storage platform. How it would be available to small business even though they don t have nearly the scope of operations found in bigger organizations. When it deployed in cloud as a service on demand this way provide a one solution when we are talking about terabytes of data. Companies like Amazon, EC2, and Rackspace are giving cloud based deployment facility of Hadoop. They allow direct upload or direct connection from client website in our case clickstream data can be uploaded or posted directly from web site web server. Index Terms: Hadoop, Clickstream, I. INTRODUCTION Businesses have greatly benefited from data analytics. Companies analyze activities such as sales, marketing, fraud and risk management, inventory optimization, and customer support to improve their strategic and tactical business decisions. However, analytics that is powerful enough to work with big data has been too expensive, complex, and compute- and resource-intensive for smaller companies to use. Now, though, vendors and service providers are leveraging advanced data-processing, network, and cloud technologies to make big analytics suitable for use by such organizations. Some vendors provide analytics software, while others offer services via the cloud. The stakes are enormous. Market research firm IDC said the big data market is growing about 40 percent annually, with revenue expected to rise from $3.2 billion in 2010 to $17.2 billion in 2015. Big Data introduction, what is why and how the big data, Challenges with big data storage and faster real time analytics. Hadoop as Big Data storage platform. The idea arises from the expensive Big Data analytics and storage platform. How it would be available to small business even though they don t have nearly the scope of operations found in bigger organizations. When it deployed in cloud as a service on demand this way provide a one solution when we are talking about terabytes of data. Companies like Amazon, EC2, and Rackspace are giving cloud based deployment facility of Hadoop. They allow direct upload or direct connection from client website in our case clickstream data can be uploaded or posted directly from web site web server. Cloud-based analytics either accepts uploaded data from a company or connects directly to its website or back-end service. Smaller companies haven t used the cloud-based approach for big analytics until now in part because the cloud-based services marketplace is relatively new. It hasn t always been easy either for smaller companies to transmit large amounts of data over networks or for service providers to return analytics quickly. Technical advances, such as faster networks and processors, have helped with this. II. BACKGROUND OVERVIEW 1 2015, IJAFRSE and VIVRUTI 2015 All Rights Reserved www.ijafrse.org

As the small companies and big analytics explains small and mid-sized organizations could benefit from analytics as much as large one. Such companies are struggling with data volumes growing 62 percent annually according to market research firm. However, these businesses haven t been able to benefit from high powered analytics and therefore can t make the most of their information Taking advantage of big data often involves a progression of cultural and technical changes throughout our business, from exploring new business opportunities to expanding sphere of inquiry to exploiting new insights as we merge traditional and big data analytics. The journey often begins with traditional enterprise data and tools, which yield insights about everything from sales forecasts to inventory levels. The data typically resides in a data warehouse and is analyzed with SQL-based business intelligence (BI) tools. Much of the data in the warehouse comes from business transactions originally captured in an OLTP database. These planning and forecasting applications can benefit from big data but organizations need advanced analytics to make this goal a reality. For more advanced data analysis such statistical analysis, data mining, predictive analytics, and text mining, companies have traditionally moved the data to dedicated servers for analysis. Exporting the data out of the data warehouse, creating copies of it in external analytical servers, and deriving insights and predictions is time consuming. It also requires duplicate data storage environments and specialized data analysis skills. Once you ve successfully built a predictive model, using that model with production data involves either complex rewriting of the model or the additional movement of large volumes of data from a data warehouse to an external data analysis server. III. APPLICATION AREAS We have recently crossed a threshold beyond which mobile technology and social media are generating datasets larger than humans can comprehend. The growth of data will never stop according to IDC Digital Universe study. In particular, three market areas are providing the bulk of the emerging applications. Above figure shows where to expect most application activity 1) Social Media 2) Sensor Network 3) Commercial industries 4) Mobile Technology Today, we re surrounded by data. People upload videos, take pictures on their cell phones, text friends, update their Facebook status, leave comments around the web, click on ads, and so forth. Machines, too, are generating and keeping more and more data. The exponential growth of data first presented challenges to cutting-edge businesses such as Google, Yahoo, Amazon, and Microsoft. They needed to go through terabytes and petabytes of data to figure out which websites were popular, what books were in demand, and what kinds of ads appealed to people. Web log files track the movement of visitors to a website, revealing who clicked where and when. This data can reveal how people interact with your site. Social media helps to understand what people are thinking or how they feel about something. It can be derived from web pages, social media sites, tweets, blog entries, email exchanges, search indexes, click streams, equipment sensors, and all types of multimedia files including audio, video, and photographic. This data can be collected not only from computers, but also from billions of mobile phones, tens of billions of social media posts, and an everexpanding array of networked sensors from cars, utility meters, shipping containers, shop floor equipment, point of sale terminals and many other sources. Most of this data is less dense and more information poor, and doesn t fit immediately into your data warehouse. 2 2015, IJAFRSE and VIVRUTI 2015 All Rights Reserved www.ijafrse.org

Fig 1. Big Data Scenario IV. Why Hadoop on cloud? Depending on your data processing needs, your Hadoop workload can vary widely over time. You may have a few large data processing jobs that occasionally take advantage of hundreds of nodes, but those same nodes will sit idle the rest of the time. You may be new to Hadoop and want to get familiar with it first before investing in a dedicated cluster. You may own a startup that needs to conserve cash and wants to avoid the capital expense of a Hadoop cluster. In these and other situations, it makes more sense to rent a cluster of machines rather than buy it. The general paradigm of provisioning compute resources as a remote service in a flexible, costeffective manner is called cloud computing. The best-known infrastructure service platform for cloud computing is Amazon Web Services (AWS). You can rent computing and storage services from AWS on demand as your requirement scales. As of this writing, renting a computational unit with the equivalent power of a 1.0 GHz 32-bit Opteron with 1.7 GB RAM and 160 GB disk storage costs $0.10 per hour. Using a cluster of 100 such machines for an hour will cost a measly $10! It wasn t too long ago that only a select few had access to clusters on this scale. Thanks to AWS and other such services, large-scale compute power is available to many people today. Because of its flexibility and cost effectiveness, running Hadoop on the AWS cloud is a popular setup. V. THE PROPOSED SYSTEM A. System Overview The above proposed system will be divided into the following sub modules or sections Fig 2. System Block Diagram 3 2015, IJAFRSE and VIVRUTI 2015 All Rights Reserved www.ijafrse.org

B. Block Diagram The Block Diagram of the system is attached. C. Explanations of Blocks Map Reduce A distributed data processing model and execution environment that runs on large clusters of commodity machines. HDFS A distributed file system that runs on large clusters of commodity machines. Pig A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and Map Reduce clusters. Hive A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to Map Reduce jobs) for querying the data. HBASE A distributed, column-oriented database. HBASE uses HDFS for its underlying storage, and supports both batch-style computations using Map Reduce and point queries (random reads). Zookeeper A distributed, highly available coordination service. Zookeeper provides primitives such as distributed locks that can be used for building distributed applications. Sqoop A tool for efficient bulk transfer of data between structured data stores (such as relational databases) and HDFS. Oozie A service for running and scheduling workflows of Hadoop jobs (including Map- Reduce, Pig, Hive, and Sqoop jobs). VI. PROJECT DEVELOPMENT METHODOLOGY OR STEPS The following will be development steps so as to achieve the working Prototype Model of the above proposed system. Project technical details: Operating System: Linux Technology Platform: Hadoop 2.0 or 1.0 Cloud (AWS/Rackspace) 4 2015, IJAFRSE and VIVRUTI 2015 All Rights Reserved www.ijafrse.org

Hue Pig Hive Languages: Pig Latin Java Backend: Hbase International Journal of Advance Foundation And Research In Science & Engineering (IJAFRSE) VII. WORKING FLOW OF THE SYSTEM The followings are the various steps for the working flow of our proposed system The system will be consisting of the following major sections 1. Set up Hadoop Development Environment 2. Hadoop Cluster Development 3. Installation of Pig on Hadoop Custer 4. Configure Hadoop Cluster with Pig 5. Cloud Virtual Image development 6. Uploading Data to hadoop Cluster 7. Upload hadoop Cluster on Cloud 8. Install and Configure Hue 9. Access data from Hadoop Cluster on cloud to External Web site 10. Perform Click Stream Analytics on Big Data Fig 3. System Deployment Diagram VIII. Component of System A. HDFS HDFS has many similarities with other distributed file systems, but is different in several respects. One noticeable difference is HDFS's write-once-read-many model that relaxes concurrency control requirements, simplifies data coherency, and enables high-throughput access. Another unique attribute of HDFS is the viewpoint that it is usually better to locate processing logic near the data rather than moving the data to the application space. HDFS rigorously restricts data writing to one writer at a time. Bytes are always appended to the end of a stream, and byte streams are guaranteed to be stored in the order written. HDFS has many goals. Here are some of the most notable: Fault tolerance by detecting faults and applying quick, automatic recovery. B. Map Reduce MapReduce is the second base component of a Hadoop framework. It manages the data processing part on the Hadoop cluster. A cluster runs MapReduce jobs written in Java or any other language (Python, 5 2015, IJAFRSE and VIVRUTI 2015 All Rights Reserved www.ijafrse.org

Ruby, R, etc.) via streaming. A single job may consist of multiple tasks. The number of tasks that can run on a data node is governed by the amount of memory installed. It is recommended to have more memory installed on each data node for better cluster performance. Based on the application workload that is going to run on the cluster, a data node may require high compute power, high disk I/O or additional memory. MapReduce programs are designed to compute large volumes of data in a parallel fashion. This requires dividing the workload across a large number of machines. This model would not scale to large clusters (hundreds or thousands of nodes) if the components were allowed to share data arbitrarily. The communication overhead required to keep the data on the nodes synchronized at all times would prevent the system from performing reliably or efficiently at large scale. Instead, all data elements in MapReduce are immutable, meaning that they cannot be updated. If in a mapping task you change an input (key, value) pair, it does not get reflected back in the input files; communication occurs only by generating new output (key, value) pairs which are then forwarded by the Hadoop system into the next phase of execution Hadoop Cluster Modes Standalone (or Local) Mode No daemons, everything runs in a single JVM. Suitable for running Map Reduce programs during development Has no DFS. Pseudo-Distributed Mode Hadoop daemons run on the local machine Fully-Distributed Mode Hadoop daemons run on a cluster of machines. C. Results It will analyses two aspects of the result that is Hadoop cluster is a set of commodity machinenetworked together in one location. Data storage and processing all occur within this cloud of machines. Different user can submit computing jobs to hadoop from individual clients which can be their own desktop machines in remote locations from the Hadoop cluster. Here we have achieved Hadoop cluster in standalone mode that everything runs in a single JVM. Which provides the system with scale out huge monolithic servers scale up and reduces the price performance of current technology. Before looking at the results, it is important to understand a few points. Statistical integrity was not achieved in this due to the cost to run the jobs. However, due to the stability and consistency of the Cloud infrastructure and how the tests were performed, there is significant value in the findings. IX. CONCLUSION The primary goal of big data analytics is to help companies make better business decisions by enabling and other users to analyze huge volumes of transaction data as well as other data sources like clickstream data that may be left untapped by conventional business intelligence programs. Handling big data, requires a lot of storage and good networks. And analytical techniques such as machine learning use considerable compute power. However, the cloud will help make big analytics possible for smaller 6 2015, IJAFRSE and VIVRUTI 2015 All Rights Reserved www.ijafrse.org

businesses by letting service providers handle the analytics. Customers can use today s inexpensive storage to collect large amounts of data, which they can then send to providers via widely available highbandwidth networks. However, the cloud has helped make big analytics possible for smaller businesses by letting service providers handle the analytics. And smaller organizations that want to run analytics software themselves can do so on more powerful servers than in the past because of technical advances in-lower-cost multicore processors. However, cloud-based services have become the more popular route because they re scalable and relatively inexpensive, and they don t require a company to have employees with data analytics-related skills. X. REFERENCES [1] A. Kumbhare, Y. Simmhan, and V. Prasanna, Cryptonite A Secure and Per formant Data Repository on Public Clouds, Proc. IEEE Cloud Computing Conf., IEEE CS, 2012, pp. 510 517. [2] PENG Junjie, ZHANG Xuejun, LEI Zhou, et al. Comparison of Several Cloud Computing Platform[c] Proceedings of Second In ternational Symposium on Information Science and Engineering: December 26-28, 2009. Shanghai, China, 2009: 23-27.. [3] Stephen Kaisler, Frank Armour, J. Alberto Espinosa, William Money, Big Data: Issues and Challenges Moving Forward, IEEE, 46th Hawaii International Conference on System Sciences, 2013. [4] Yuri Demchenko, Zhiming Zhao, Paola Grosso, Adianto Wibisono, Cees de Laat, Addressing Big Data Challenges for Scientific Data Infrastructure, IEEE, 4th International Conference on Cloud Technology and Science, 2012. [5] Matthew Smith, Christian Szongott, Benjamin Henne, Gabriele von Voigt, Big Data Privacy Issues in Public Social Media, IEEE, 6 th International Conference on Digital Ecosystems Technologies (DEST), 18-20 June 2012. 7 2015, IJAFRSE and VIVRUTI 2015 All Rights Reserved www.ijafrse.org