REVIEW PAPER ON BIG DATA USING HADOOP



Similar documents
NoSQL and Hadoop Technologies On Oracle Cloud

Chapter 7. Using Hadoop Cluster and MapReduce

BIG DATA CHALLENGES AND PERSPECTIVES

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Suresh Lakavath csir urdip Pune, India

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Keywords: Big Data, HDFS, Map Reduce, Hadoop

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Big Application Execution on Cloud using Hadoop Distributed File System

Data Refinery with Big Data Aspects

A Survey on Big Data Concepts and Tools

International Journal of Innovative Research in Computer and Communication Engineering

How To Scale Out Of A Nosql Database

Big Data With Hadoop

Hadoop. Sunday, November 25, 12

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

Big Systems, Big Data

How To Handle Big Data With A Data Scientist

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Efficient Analysis of Big Data Using Map Reduce Framework

BIG DATA IN BUSINESS ENVIRONMENT

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Big Data: Study in Structured and Unstructured Data

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

Hadoop implementation of MapReduce computational model. Ján Vaňo

A Brief Outline on Bigdata Hadoop

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

BIG DATA FUNDAMENTALS

Log Mining Based on Hadoop s Map and Reduce Technique

CSE-E5430 Scalable Cloud Computing Lecture 2

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

International Journal of Advancements in Research & Technology, Volume 3, Issue 5, May ISSN BIG DATA: A New Technology

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

UNDERSTANDING THE BIG DATA PROBLEMS AND THEIR SOLUTIONS USING HADOOP AND MAP-REDUCE

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Application Development. A Paradigm Shift

Big Data: Tools and Technologies in Big Data

Transforming the Telecoms Business using Big Data and Analytics

MapReduce and Hadoop Distributed File System V I J A Y R A O

Journal of Environmental Science, Computer Science and Engineering & Technology

A Comparative Study on Operational Database, Data Warehouse and Hadoop File System T.Jalaja 1, M.Shailaja 2

Big Data on Cloud Computing- Security Issues

MapReduce and Hadoop Distributed File System

Hadoop Technology for Flow Analysis of the Internet Traffic

Data Mining in the Swamp

ISSN: (Online) Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

L1: Introduction to Hadoop

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Hadoop Big Data for Processing Data and Performing Workload

CSC590: Selected Topics BIG DATA & DATA MINING. Lecture 2 Feb 12, 2014 Dr. Esam A. Alwagait

Hadoop & its Usage at Facebook

Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop

Open source Google-style large scale data analysis with Hadoop

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Of all the data in recorded human history, 90 percent has been created in the last two years. - Mark van Rijmenam, Think Bigger, 2014

COMP9321 Web Application Engineering

Hadoop & its Usage at Facebook

Big Data Introduction, Importance and Current Perspective of Challenges

BIG DATA TECHNOLOGY. Hadoop Ecosystem

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Data-Intensive Computing with Map-Reduce and Hadoop

Indian Journal of Science The International Journal for Science ISSN EISSN Discovery Publication. All Rights Reserved

Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Manifest for Big Data Pig, Hive & Jaql

Investigative Research on Big Data: An Analysis

Large scale processing using Hadoop. Ján Vaňo

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

CDH AND BUSINESS CONTINUITY:

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

Big Data and Analytics: Challenges and Opportunities

Real Time Processing of Web Pages for Consumer Analytics

Internals of Hadoop Application Framework and Distributed File System

Big Data Analytics OverOnline Transactional Data Set

Big Data Analytics Hadoop and Spark

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Apache Hadoop FileSystem and its Usage in Facebook

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

International Journal of Engineering Research ISSN: & Management Technology November-2015 Volume 2, Issue-6

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Transcription:

International Journal of Computer Engineering & Technology (IJCET) Volume 6, Issue 12, Dec 2015, pp. 65-71, Article ID: IJCET_06_12_008 Available online at http://www.iaeme.com/ijcet/issues.asp?jtype=ijcet&vtype=6&itype=12 ISSN Print: 0976-6367 and ISSN Online: 0976 6375 IAEME Publication REVIEW PAPER ON BIG DATA USING HADOOP Ms. Gurpreet Kaur Student, PG Department of Computer Science, Mata Gujri College Fatehgarh Sahib, Punjab Ms. Manpreet Kaur Assistant Professor, PG Department of Computer Science, Mata Gujri College, Fatehgarh Sahib, Punjab ABSTRACT Big data is a combination of big and complex data sets that have the vast volume of data, social media analytics, data management efficiency, real-time data. Big data analytics is the procedure of study huge amounts of data. Big Data is represent by the dimensions volume, variety, velocity and veracity.for big data processing there is a method which is called Hadoop which uses the map reduce paradigm to process the data. Key words: Big Data, Parameters, Challenges, Opportunities, Hadoop Cite this Article: Ms. Gurpreet Kaur and Ms. Manpreet Kaur. Case Study of Color Model of Image Processing. International Journal of Computer Engineering and Technology, 6(12), 2015, pp. 65-71. http://www.iaeme.com/ijcet/issues.asp?jtype=ijcet&vtype=6&itype=12 1. INTRODUCTION Big data means actually a big data; it is a collection of huge datasets that cannot be handled using old computing techniques. Big data is not only containing data, it also contains various tools, techniques and frameworks. Data that has extra-large Volume, comes from Variety of sources, Variety of formats and comes at us with a great Velocity is normally referred to as Big Data. Big data can be structured, unstructured or semi-structured. Big data hold the data generate by various equipment and applications like Black box Data which is a part of helicopter. It catches sound of flight crew, recording of Earphones and megaphones. Social media data Twitter is also a part of social media which hold information and the view posted by millions of people. Stock exchange data which hold the data about the purchase and retail decisions made on a share of different companies made by the customer. Search engine data which recruit lot of data from various databases. http://www.iaeme.com/ijcet/index.asp 65 editor@iaeme.com

Ms. Gurpreet Kaur and Ms. Manpreet Kaur 2. BIG DATA PARAMETERS As the data is bigger from different sources in different form, it is represent by the 4 Vs. Figure 1 Four V s of BIG DATA. Volume: Volume means ratio of data or huge amount of data develop in every second. Machine develop data are examples for these components. Nowadays data volume is increasing from gigabytes to peta bytes. 40 Zetta bytes of data will be created by 2020 which is 300 times from2005. Velocity: Velocity is the speed at which data is developing and processed. For example social media posts. Variety: Variety is one more important characteristic of big data. It refers to the type of data. Data may be in different styles such as Text, numerical, images, audio, video data. On twitter 400 million tweets are sent per day and there are 200 million active users on it. Veracity: Veracity means anxiety or accuracy of data. Data is uncertain due to the inconsistency and in completeness. 3. CHALLENGES WITH BIG DATA 3.1. Heterogeneity and Incompleteness If we want to evaluate the data, it should be structured but when we good deal with the Big Data, data may be structured or unstructured as well. Heterogeneity is the big challenge in data Analysis and analysts need to cope with it. Consider an example of patient in Hospital. We will generate each record for each medical test. And we will also generate a record for hospital stay. This will be unequal for all patients. This design is not well structured. So managing with the Heterogeneous and incomplete is required. A good data analysis should be applied to this. http://www.iaeme.com/ijcet/index.asp 66 editor@iaeme.com

Review Paper on Big Data Using Hadoop 3.2. Privacy Privacy of data is one bigger problem with big data. In some countries there are tough laws about the data privacy, for example in USA there are tough law for health records, but for others it is less forceful. For example in social media we cannot get the private posts of users for sentiment analysis. 3.3. Scale As the name says Big Data is having huge size of data sets. Managing with large data sets is a big problem from decades. In previous years, this problem was solved by the processors getting faster but now data quantity is becoming large and processors are static. World is moving towards the Cloud technology, due to this shift data is generated in a very high rate. This high rate of increasing data is becoming a challenging problem to the data analysts. Hard disks are used to store the Data. They are slower I/O performance. But now Hard Disks are replaced by the solid state drives and other technologies. These are not in slower rate like Hard disks, so new storage system should be constructing. 3.4. Human Collaborations In spite of the advanced computational models, there are many patterns that a computer cannot recognize. A new method of harnessing human ingenuity to solve problem is crowd-sourcing. Wikipedia is the perfect example. We are reliable on the data given by the strangers, however most of the time they are correct. But there can be other people with other motives as well as like providing false data. We need technological model to handle with this. As humans, we can look the review of book and find that few are positive and few are negative and come up with a decision to whether buy or not. 4. OPPORTUNITIES TO BIG DATA 4.1. Media Media is using big data for the boost and sale of products by focus the interest of the user on internet. For example social media posts, data analysts get the number of posts and then evaluate the interest of user. It can also be done by taking the positive or negative reviews on the social media. 4.2. Government Big data can be used to handle the issues faced by the government. Obama government declared big data research and development initiative in 2012. Big data analysis played an important role of BJP winning the elections in 2014 and Indian government is implement big data analysis in Indian electorate. 4.3. Technology Almost each top organization like Facebook and yahoo has adopted Big Data and are spending on big data. Facebook holds 50 Billion photos of users. Every month Google holds 100 billion searches. From these stats we can say that there are a lot of opportunities on internet, social media. http://www.iaeme.com/ijcet/index.asp 67 editor@iaeme.com

Ms. Gurpreet Kaur and Ms. Manpreet Kaur 4.4. Science and Research Big data is an up-to-the-minute topic of research. Large number of Researchers is working on big data. There are so many papers being published on big data. 4.5. Healthcare According to IBM Big data for Healthcare, 80% of medical data is unstructured. Healthcare organizations are adapting big data technology to grab the complete data about a patient. To boost the healthcare and low down the cost big data analysis are needed and certain technology should be adapted. 5. TECHNIQUES AND TECHNOLOGIES For the purpose of processing the large amount of data, the big data requires exceptional technologies. The various techniques and technologies have been introduced for manipulating, analyzing and visualizing the big data. There are many solutions to handle the Big Data, but the Hadoop is one of the most widely used technologies. 5.1. Hadoop Hadoop is an open source project hosted by Apache Software Foundation. It consists of many small sub projects which belong to the category of infrastructure for distributed computing. Hadoop mainly consists of: 1. File System (The Hadoop File System) 2. Programming Paradigm (Map Reduce) The other subprojects allow correlative services or they are building on the core to add higher-level abstractions. There exist several problems in dealing with storage of huge amount of data. Though the storage capacities of the drives have expanded massively but the rate of reading data from them hasn t shown that considerable progress. The reading process takes huge amount of time and the process of writing is also slower. The time can be decreased by reading from multiple disks at once. Only using one hundredth of a disk may seem wasteful. But if there are one hundred datasets, each of which is one terabyte and providing shared access to them is also a solution. There exist many problems also with using several pieces of hardware as it increment the chances of failure. This can be deflecting by Replication i.e. creating redundant copies of the similar data at distinct devices so that in case of failure the copy of the data is possible. The main problem is of combinative the data being read from different machines. There are so many methods are able in distributed computing to handle this problem but still it is quite challenging. All the complication discussed is easily managed by Hadoop. The problem of failure is directed by the Hadoop Distributed File System and problem of combining data is directed by Map reduce programming Paradigm. Map Reduce basically diminish the complication of disk reads and writes by giving a programming model dealing in computation with keys and values. There are few components of Hadoop which are as: 5.1.1. Hadoop Distributed File System Hadoop develop with a distributed File System called HDFS, HDFS stands for Hadoop Distributed File System. The Hadoop Distributed File System is a versatile, clustered way to handling files in a big data environment. HDFS is not the final terminal for files. It is a kind of data service that offers a different set of capabilities http://www.iaeme.com/ijcet/index.asp 68 editor@iaeme.com

Review Paper on Big Data Using Hadoop required when data volumes and velocity are high. Because the data is written once and then read many times. HDFS is a good choice for supporting big data analysis. Figure 2 HDFS Architecture HDFS works by cracking large files into small parts called blocks. The blocks are stored on data nodes, and it is the responsibility of the NameNode to notice what blocks on which data nodes make up the complete file. The Name Node also perform as a traffic cop, handling all access to the files. The entire collection of all the files in the cluster is sometimes referred to as the file system namespace. Even though a strong relationship occurs between the Name Node and the data nodes, they run in a loosely coupled fashion. This allows the cluster elements to act dynamically. The data nodes communicate among themselves so that they can cooperate during normal file system operations. This is mandatory because blocks for one file are likely to be stored on multiple data nodes. 5.1.2. Map Reduce Hadoop Map Reduce is an implementation of the algorithm developed and managed by the Apache Hadoop project. It is helpful to think about this application as a Map Reduce engine, because that is absolutely how it works. You provide information (fuel); the engine converts the information into production rapidly. Hadoop Map Reduce consists of many stages, each with a meaningful set of operations helping to get the answers you need from big data. The development starts with a user request to run a Map Reduce program and go on until the results are written back to the HDFS. http://www.iaeme.com/ijcet/index.asp 69 editor@iaeme.com

Ms. Gurpreet Kaur and Ms. Manpreet Kaur Figure 3 Map Reduce Architecture 5.1.3. Get the Big Data Ready When a client requests a Map Reduce program to run, the first step is to discover and study the input file. The file pattern is totally arbitrary, but the data must be changed to something the program can process. This is the activity of Input Format and Record Reader. Input Format chooses how the file is going to be broken into smaller pieces for processing using a function called Input Split. It then assigns a Record Reader to transform the raw data for processing by the map. Different types of Record Readers are supplied with Hadoop, offering a wide variety of conversion options. 5.1.4. Let the Big Data Map Begin Your data is now in a mode tolerable to map. For all input pair, a specific instance of map is called to process the data for each input pair. Map and reduce need to perform together to process your data, the program needs to gather the output from the separate mappers and pass it to the reducers. This task is done by an Output Collector. A Reporter function also present information collected from map tasks. This entire task is being done on multiple nodes in the Hadoop cluster simultaneously. After all the map tasks are complete, the common results are collected in the partition and a shuffling occurs, sorting the output for excellent processing by reduce. 5.1.5. Reduce and combine for big data For each output pair, reduce is called to do its task. In similar pattern to map, reduce collect its output while all the functions are processing. Reduce can t begin as far as all the mapping is done. The output of reduce is also a key and a value. Hadoop provides an Output Format feature, and it performs very much like Input Format. Output Format takes the key-value pair and formulates the output for writing to HDFS. The last task is to actually write the data to HDFS. This is done by Record Writer, and it performs similarly to Record Reader except in reverse. It takes the Output Format data and writes it to HDFS. http://www.iaeme.com/ijcet/index.asp 70 editor@iaeme.com

Review Paper on Big Data Using Hadoop 6. CONCLUSION There are various challenges and issues about big data. There must support and encourage fundamental research towards these technical issues if we want to achieve the benefits of big data. Big-data essentially convert operational, financial and commercial problems in aviation that were already unsolvable within economic and human capital constraints using discrete data sets. By centralizing data acquisition and consolidation in the cloud, and by using cloud based virtualization infrastructure to mine data sets efficiently, big-data methods provide new insight into extant data sets. REFERENCES [1] A Research Paper on Big Data and methodology by Shilpa and Manjit Kaur [2] Review Paper on Use of Big Data in E-Governance of India by Shubham Kalbande, Sumant Deshpande and Prof. Mohit Popat [3] Review paper on big data and Hadoop by Harshawardhan S. Bhosale, Prof. Devendra P. Gadekar [4] Big Data in Big Companies by Thomas H. Davenport Jill Dyché [5] Big Data And Hadoop: A Review Paper by Rahul Beakta CSE Deptt., Baddi University of Emerging Sciences & Technology, Baddi, India [6] A Survey on Big Data Analysis Techniques by Himanshu Rathod, Tarulata Chauhan [7] A review on Hadoop HDFS infrastructure extensions by Kala Karun A, Chitharanjan K [8] http://www.tutorialspoint.com/hadoop [9] /hadoop_big_data_overview.htm [10] Hadoop Map Reduce for Big Data by Judith Hurwitz, Alan Nugent, Fern Halper and Marcia Kaufman [11] Suhas V. Ambade and Prof. Priya Deshpande. Hadoop Block Placement Policy for Different File Formats. International Journal of Computer Engineering and Technology, 5(12), 2014, pp. 249-256. [12] Gandhali Upadhye and Astt. Prof. Trupti Dange. Nephele: Efficient Data Processing Using Hadoop. International Journal of Computer Engineering and Technology, 5(7), 2014, pp. 11-16. http://www.iaeme.com/ijcet/index.asp 71 editor@iaeme.com