International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop



Similar documents
BIG DATA CHALLENGES AND PERSPECTIVES

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

Data Refinery with Big Data Aspects

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

How To Handle Big Data With A Data Scientist

Transforming the Telecoms Business using Big Data and Analytics

International Journal of Innovative Research in Computer and Communication Engineering

Improving Data Processing Speed in Big Data Analytics Using. HDFS Method

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

White Paper. Version 1.2 May 2015 RAID Incorporated

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

ISSN: International Journal of Innovative Research in Technology & Science(IJIRTS)

Chapter 7. Using Hadoop Cluster and MapReduce

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

Indian Journal of Science The International Journal for Science ISSN EISSN Discovery Publication. All Rights Reserved

Journal of Environmental Science, Computer Science and Engineering & Technology

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

Efficient Analysis of Big Data Using Map Reduce Framework

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

BIG DATA What it is and how to use?

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

ANALYTICS BUILT FOR INTERNET OF THINGS

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

Are You Ready for Big Data?

How To Scale Out Of A Nosql Database

Introduction to Engineering Using Robotics Experiments Lecture 17 Big Data

locuz.com Big Data Services

Boarding to Big data

Are You Ready for Big Data?

The Rise of Industrial Big Data

Hadoop IST 734 SS CHUNG

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

BIG DATA: STORAGE, ANALYSIS AND IMPACT GEDIMINAS ŽYLIUS

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Big Data: Tools and Technologies in Big Data

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

The 3 questions to ask yourself about BIG DATA

Suresh Lakavath csir urdip Pune, India

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

The 4 Pillars of Technosoft s Big Data Practice

Big Data Analytics. Lucas Rego Drumond

Large scale processing using Hadoop. Ján Vaňo

Big Data with Rough Set Using Map- Reduce

Massive Cloud Auditing using Data Mining on Hadoop

SQLSaturday #399 Sacramento 25 July, Big Data Analytics with Excel

Of all the data in recorded human history, 90 percent has been created in the last two years. - Mark van Rijmenam, Think Bigger, 2014

Exploring Big Data in Social Networks

Client Overview. Engagement Situation. Key Requirements

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Big Data: Study in Structured and Unstructured Data

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Statistical Challenges with Big Data in Management Science

Hadoop. Sunday, November 25, 12

Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop

BIG DATA: BIG CHALLENGE FOR SOFTWARE TESTERS

IoT and Big Data- The Current and Future Technologies: A Review

Big Data. Fast Forward. Putting data to productive use

Big Data & Analytics. Counterparty Credit Risk Management. Big Data in Risk Analytics

Spatio-Temporal Networks:

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

BIG DATA: CHALLENGES AND OPPORTUNITIES IN LOGISTICS SYSTEMS

EXECUTIVE REPORT. Big Data and the 3 V s: Volume, Variety and Velocity

Big Data Analytics. Genoveva Vargas-Solar French Council of Scientific Research, LIG & LAFMIA Labs

Big Data in Transportation Engineering

Open source Google-style large scale data analysis with Hadoop

BIG DATA FUNDAMENTALS

The Challenge of Handling Large Data Sets within your Measurement System

Testing 3Vs (Volume, Variety and Velocity) of Big Data

How To Create A Data Science System

The Next Wave of Data Management. Is Big Data The New Normal?

DATA MINING AND WAREHOUSING CONCEPTS

Collaborations between Official Statistics and Academia in the Era of Big Data

Interactive data analytics drive insights

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Keywords: Big Data, HDFS, Map Reduce, Hadoop

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

ISSN: (Online) Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

BIG DATA. Value 8/14/2014 WHAT IS BIG DATA? THE 5 V'S OF BIG DATA WHAT IS BIG DATA?

Log Mining Based on Hadoop s Map and Reduce Technique

Manifest for Big Data Pig, Hive & Jaql

NoSQL for SQL Professionals William McKnight

BIG DATA TRENDS AND TECHNOLOGIES

Understanding traffic flow

Big Data Analytics Hadoop and Spark

Storage and Retrieval of Data for Smart City using Hadoop

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

NextGen Infrastructure for Big DATA Analytics.

HP Vertica at MIT Sloan Sports Analytics Conference March 1, 2013 Will Cairns, Senior Data Scientist, HP Vertica

Using Tableau Software with Hortonworks Data Platform

Hur hanterar vi utmaningar inom området - Big Data. Jan Östling Enterprise Technologies Intel Corporation, NER

A Database Hadoop Hybrid Approach of Big Data

Transcription:

ISSN: 2454-2377, October 2015 Big Data and Hadoop Simmi Bagga 1 Satinder Kaur 2 1 Assistant Professor, Sant Hira Dass Kanya MahaVidyalaya, Kala Sanghian, Distt Kpt. INDIA E-mail: simmibagga12@gmail.com 2 Assistant Professor, GNDU, RC, Sathiala, INDIA Abstract: In this era data are continuously acquired for a variety of purposes. Data are generated from large-scale simulations, astronomical observatories, high-throughput experiments, or highresolution sensors. As they all are used lead to new discoveries if scientists have adequate tools to extract knowledge from them. Hadoop is a Software platform that process vast amount of data. It helps to manage Big data. Hadoop can handle all types of data from disparate systems: structured, unstructured, log files, pictures, audio files, communications records, email etc. In this paper, we discuss the concept of Big data, its sources and its characteristics. Some additional characteristics of Big data further more we describe the platform Hadoop which provides solution for Big data problem. Keywords: Big Data; Era Data; Hadoop; Large-scale Simulations; Astronomical Observatories I. INTRODUCTION Big Data is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it. Big data is a voluminous amount of structured, semi-structured and unstructured data that has the potential to be mined for information. Although big data doesn't refer to any specific quantity the term is often used when speaking about petabytes and exabytes of data etc. But today Big data can describe with 3Vs: the extreme Volume of data, the wide Variety of types of data and the Velocity at which data is traversing. As Big data takes too much time and costs too much money to load into a traditional relational database for analysis. So, new approaches to storing and analyzing data have been emerged which rely less on data schema and data quality. The goal of big data management is to ensure a high level of data quality and accessibility for business intelligence and big data analytics applications. Corporations, government agencies and other organizations employ big data management strategies to help them compete with fast-growing pools of data, typically involving many terabytes or even petabytes of information which have been saved in a variety of file formats. Effective big data management helps companies to locate and represent valuable information in large sets of unstructured data and semi-structured data from a variety of sources, including call detail records, system logs and social media sites. There are huge volumes of data in the world. Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or does not fit the structures of existing database architectures. From the beginning of recorded time until 2003, we created 5 billion gigabytes (exabytes) of data. In 2011, the same amount was created every two days. In 2013, the same amount of data is created in every 10 minutes. 90% of the data in the world today has been created in the last two years alone. There are two major architectural changes that are needed in the traditional integration platforms in order to fulfill future requirements. First, the ability and needs for organizations to store www.ijaera.org 2015, IJAERA - All Rights Reserved 233

and use big data. Most of the big data should always be available for a longtime and there should be tools and techniques available to process be same for the business benefits. Then second, the need for predictive analytical model which must be based on the history or patterns of past or hypothetical data driven models. While the business intelligence deals with what has happened, business analytics deal with what is expected to happen that is to forecast the situation. II. SOURCES OF BIG DATAT Big data comes from different sources like sensors which are used to gather climate information from social media sites while posting digital pictures and videos from transaction records while purchasing goods, and from cell phone GPS signals etc. as shown in fig1. Using social media like Facebook we produce lots of big data when we use to look it as when we look at another person's timeline, send or receive a message, when we post things like photos or videos on Facebook etc. We receive data from or about the computer, mobile phone, or other devices you use to install Facebook apps or to access Facebook.We receive data whenever we visit a game, application, or website that uses Facebook Platform. Sometimes we get data from some advertising partners, customers and other third parties that help us to deliver ads. Fig1. Sources of Big Data Different Physical Sensor data also produces high velocity, volume, variety of data. Sensors are used for geolocation, temperature, noise, attention, engagement, biometrics purposes to collect data in a variety of ways. When we use this large data to user context and to predict behavior we have to process this huge data that is a big challenge. www.ijaera.org 2015, IJAERA - All Rights Reserved 234

The progress and innovation is no longer hindered by the ability to collect data. But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion is necessary. III. CHARACTERISTICS OF BIG DATA "Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization Gartner 2012.Big data can be described by the following characteristics: 3(a). Main Charactristics i.e. 3 V s I. Volume Big data management software enables organizations to store, manage, and manipulate vast amounts of data at the right speed and at the right time. It is the size of the data which determines the value and potential of the data under consideration and whether it can actually be considered as Big Data or not. The quantity of data that is generated is very important in this context. Which data is generated by machines, networks and human interaction on systems like social media is very massive. The name Big Data itself contains a term which is related to characterized size that describe it. II. Variety Variety describes different formats of data. These include a long list of data such as documents, emails, social media text messages, video, still images, audio, graphs, and the output from all types of machine-generated data from sensors, cell phone GPS signals, DNA analysis devices, and more. This type of data is characterized as unstructured or semi-structured. This variety of unstructured data creates problems for storage, mining and analyzing it. Unstructured data is growing much more rapidly than structured data. It is estimated that unstructured data doubles every three months and offers the example that there are seven million web pages added each day. There are two primary challenges regarding variety of data. First, storing and retrieving these data types quickly and cost efficiently. Second, during analysis, blending or aligning data types from different sources so that all types of data describing a single event can be extracted and analyzed together. III. Velocity Today Data is generated too fast and also need to be processed fast. Big Data Velocity deals with the pace at which data flows in from sources like business processes, machines, networks and human interaction with things like social media sites, mobile devices, etc. The flow of data is massive and continuous. This real-time data can help researchers only if you are able to handle the velocity. Velocity is applied to data in motion. There are various information streams and the increase in sensor network deployment has led to a constant flow of data at a pace that has made it impossible for traditional systems to handle. Initially analysis of data is done by using a batch process. With the new sources of data such as social and mobile applications, the batch process breaks down. The data is now streaming into the server in real time, in a continuous fashion and the result is only useful if the delay is very short. www.ijaera.org 2015, IJAERA - All Rights Reserved 235

Figure: 2. How 3 V s exploring 3(b). Additional Characteristics I. Veracity Big Data Veracity refers to the biases, noise and abnormality in data. Is the data that is being stored, and mined meaningful to the problem being analyzed. Veracity in data analysis is the biggest challenge as compared to volume and velocity. In scoping your big data strategy ones need to have his own team and partners together to work to clean the data in the system to avoid dirty data from accumulating in his systems. II. Validity Like big data veracity, the issue of validity deals with the accuracy of data. So that it can be properly use from decision making and fir future conclusions. III. Volatility Big data volatility refers to how long is data valid and how long should it be stored. In this world of real time data one need to determine at what point is data no longer relevant to the current analysis. Big data management clearly deals with issues beyond volume, variety and velocity to other concerns like veracity, validity and volatility to hear about other big data trends and presentation and uses. www.ijaera.org 2015, IJAERA - All Rights Reserved 236

IV. HADOOP : A BIG DATA MANAGEMENT PLATFORM Hadoop is a processing engine that is designed to handle extremely high volumes of data in any structure. Hadoop provides distributed storage and distributed processing for very large data sets. Hadoop has two main components: The Hadoop distributed file system (HDFS), which supports data in structured relational form, in unstructured form, and in any form in between. HDFS is a reliable distributed file system that provides high through put access to data. The MapReduce programing paradigm which is meant for managing applications on multiple distributed servers. It is a framework for performing high performance distributed data processing and is based on divide and aggregate paradigm. HDFS, or the Hadoop Distributed File System, gives the programmer unlimited storage. The advantages of HDFS are: Horizontal scalability: Thousands of servers holding petabytes of data. When you need even more storage,we don't switch to more expensive solutions, but add servers instead. Commodity hardware: HDFS is designed with relatively cheap commodity hardware in mind. HDFS is self-healing and replicating. Fault tolerance: Hadoop also deal with hardware failures. If one have 10 thousand servers, then he will see one server fail every day, on average. HDFS foresees that by replicating the data, by default three times, on different data node servers. Thus, if one data node fails, the other two can be used to restore the third one in a different place. MapReduce takes care of distributed computing. It reads the data, usually from its storage, the Hadoop Distributed File System (HDFS), in an optimal way. However, it can read the data from other places too; including mounted local file systems, the web, and databases. It divides the computations between different computers (servers, or nodes). It is also fault-tolerant. If some of nodes fail, Hadoop knows how to continue with the computation, by re-assigning the incomplete work to another node and cleaning up after the node that could not complete its task. It also knows how to combine the results of the computation in one place A set of machines running HDFS and MapReduce is known as a Hadoop Cluster. Individual machines are known as nodes. A cluster can have as few as one node, as many as several thousands. More nodes in the cluster mean better is the performance. These both the components are Hadoop provide ultimate solution to handle the problem of Big data. Hadoop meets all the requirements related to reliability, scalability etc which are the major challenges with the big data. V. CONCLUSION In this paper, we explained big data and its sources. We also explained the characteristics of Big data and also explained how Hadoop can be the solution of the problem of Big Data. Due to the scale, diversity, and complexity of Big data a new architecture, techniques, algorithms, and analytics are required to manage it and extract value and hidden knowledge from it. There are various www.ijaera.org 2015, IJAERA - All Rights Reserved 237

architectural changes that are needed in the traditional platforms in order to overcome future challenges. Hadoop has become a valuable business intelligence tool. Hadoop is a processing engine that is designed to handle extremely high volumes of data in any structure. VI. REFERENCES [1] Beyer, M.A, Laney, D.: The Importance of 'big data': a Definition. Gartner (2012). [2] Kyuseok Shim, MapReduce Algorithms for Big Data Analysis, DNIS 2013. [3] VinayakBorkar, Michael J. Carey, Chen Li, Inside Big Data Management :Ogres, Onions, or Parfaits?, EDBT/ICDT 2012 Joint Conference Berlin, Germany,2012 ACM 2012, [4] Hadoop, PoweredbyHadoop, http://wiki.apache.org/hadoop/poweredby. www.ijaera.org 2015, IJAERA - All Rights Reserved 238