A REVIEW ON EFFICIENT DATA ANALYSIS FRAMEWORK FOR INCREASING THROUGHPUT IN BIG DATA. Technology, Coimbatore. Engineering and Technology, Coimbatore.



Similar documents
Big Data Racing and parallel Database Technology

AN EFFECTIVE PROPOSAL FOR SHARING OF DATA SERVICES FOR NETWORK APPLICATIONS

Dynamo: Amazon s Highly Available Key-value Store

Scalable Multiple NameNodes Hadoop Cloud Storage System

Seminar Presentation for ECE 658 Instructed by: Prof.Anura Jayasumana Distributed File Systems

ANALYSIS OF SMART METER DATA USING HADOOP

A Survey on Accessing Data over Cloud Environment using Data mining Algorithms

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

Chapter 2 Related Technologies

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Survey on Load Rebalancing for Distributed File System in Cloud

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Joining Cassandra. Luiz Fernando M. Schlindwein Computer Science Department University of Crete Heraklion, Greece

An Indexing Framework for Efficient Retrieval on the Cloud

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Big Data Storage Architecture Design in Cloud Computing

International Journal of Engineering Research ISSN: & Management Technology November-2015 Volume 2, Issue-6

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

HadoopRDF : A Scalable RDF Data Analysis System

SOLVING LOAD REBALANCING FOR DISTRIBUTED FILE SYSTEM IN CLOUD

Private Search Over Big Data Leveraging Distributed File System. and Parallel Processing. Ayse Selcuk, Cengiz Orencik and Erkay Savas

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Integrating Big Data into the Computing Curricula

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Data-Intensive Computing with Map-Reduce and Hadoop

An Overview of Knowledge Discovery Database and Data mining Techniques

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Advances in Natural and Applied Sciences

Introduction to Hadoop


Keywords: Big Data, HDFS, Map Reduce, Hadoop

MANAGEMENT OF DATA REPLICATION FOR PC CLUSTER BASED CLOUD STORAGE SYSTEM

BIG DATA What it is and how to use?

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Chapter 7. Using Hadoop Cluster and MapReduce

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Challenges for Data Driven Systems

International Journal of Innovative Research in Computer and Communication Engineering

MapReduce and Hadoop Distributed File System

Comparison of Different Implementation of Inverted Indexes in Hadoop

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Log Mining Based on Hadoop s Map and Reduce Technique

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Big Data With Hadoop

Snapshots in Hadoop Distributed File System

How To Analyze Big Data In Healthcare

Survey on Scheduling Algorithm in MapReduce Framework

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Design of Electric Energy Acquisition System on Hadoop

NoSQL and Hadoop Technologies On Oracle Cloud

DISTRIBUTION OF DATA SERVICES FOR CORPORATE APPLICATIONS IN CLOUD SYSTEM

Parallel Processing of cluster by Map Reduce

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Classification On The Clouds Using MapReduce

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

CLOUD BASED PEER TO PEER NETWORK FOR ENTERPRISE DATAWAREHOUSE SHARING

A computational model for MapReduce job flow

BIG DATA ANALYSIS USING RHADOOP

Big Data: Study in Structured and Unstructured Data

A Database Hadoop Hybrid Approach of Big Data

Distributed Framework for Data Mining As a Service on Private Cloud

Big Application Execution on Cloud using Hadoop Distributed File System

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Review of Query Processing Techniques of Cloud Databases Ruchi Nanda Assistant Professor, IIS University Jaipur.

Hadoop Parallel Data Processing

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

L1: Introduction to Hadoop

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Research and Application of Redundant Data Deleting Algorithm Based on the Cloud Storage Platform

Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers

Cloud Computing in Distributed System

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

A Survey on Parallel Method for Rough Set using MapReduce Technique for Data Mining

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

International Journal of Advance Research in Computer Science and Management Studies

Hadoop Operations Management for Big Data Clusters in Telecommunication Industry

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

UPS battery remote monitoring system in cloud computing

CSE-E5430 Scalable Cloud Computing Lecture 2

A Brief Outline on Bigdata Hadoop

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Analyzing Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environmen

Sunnie Chung. Cleveland State University

Suresh Lakavath csir urdip Pune, India

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

MapReduce Job Processing

Indian Journal of Science The International Journal for Science ISSN EISSN Discovery Publication. All Rights Reserved

Hadoop Architecture. Part 1

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Introduction to Hadoop

MapReduce and Hadoop Distributed File System V I J A Y R A O

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

Transcription:

A REVIEW ON EFFICIENT DATA ANALYSIS FRAMEWORK FOR INCREASING THROUGHPUT IN BIG DATA 1 V.N.Anushya and 2 Dr.G.Ravi Kumar 1 Pg scholar, Department of Computer Science and Engineering, Coimbatore Institute of Engineering and Technology, Coimbatore. 2 Assistant Professor, Department of Computer Science and Engineering, Coimbatore Institute of Engineering and Technology, Coimbatore. ABSTRACT Background: The guarantee of information driven choice making is presently being perceived comprehensively, and there is developing an eagerness for the idea of big Data. Much data today is not natively in structured format, for example, tweets and blogs are weakly structured pieces of text, while images and video are structured for storage and display, but not for semantic content and search. Transforming such content into a structured format for later analysis is a major challenge. Objective: This paper is to use the classification technique before mapping the tasks into the resources. The MapReduce will take more time to decide the resource for performing the tasks which is to be allocated. Parallel Database technology is used to increase the performance of Big data because it allocate the tasks in parallel into the resources. Result: In this model, for classifying the tasks, Ensemble Classifier is used. The Support Vector Machine, Decision Tree and K-Nearest Neighbor are the classifiers used to produce an Ensemble Classifier. Therefore, the data s will be processed with minimal scheduling time Along with Ensemble Classifier, Map Reduce model and Parallel Database Technology are used which increases the efficiency and throughput of Big Data by reducing the scheduling time. Conclusion: This will be demonstrated the efficiency, effectiveness, and scalability of the big data analysis methods. Index Terms MapReduce, Hadoop, Ensemble Classifier, Parallel Database. INTRODUCTION Big data is capable of handling large datasets at a time. In an expansive reach of provision ranges, information is, no doubt. Data are being collected at unprecedented scale. The choices that beforehand were focused around mystery, or on meticulously developed models of actuality, can now be made focused around the information itself. The such Big Data analysis now drives nearly every aspect of our modern society, including mobile services, retail, manufacturing, financial services, life sciences, and physical sciences. It can perform data storage, data analysis, and data processing and data management techniques in parallel. It can handle both structured and unstructured data at a time. Big data analytics will be most useful for hospital management and government sectors, especially in climate condition monitoring. The three characteristics of big data, namely volume, velocity and variety. The characteristics are explained in detail below. Many factors contribute to the increase in data volume. Volume refers to the sense of storage in Big data. For example, in Facebook 2.5 peta bytes of data are processed per day. Defeating the volume, issue obliges advances that store immense measures of information in an adaptable manner and give disseminated methodologies to questioning or finding that information. Velocity describes the frequency at which data is created, caught, and imparted. The velocity of large data streams control the capacity to parse the content, distinguish conclusion, and recognize new pattern.. Structured data refer to numeric data in traditional databases. Unstructured data like text documents, email, video, audio, stock ticker data and financial transactions. Hadoop is the most popular

open source framework used in Big data to handle large data sets and also implements the MapReduce framework (Wei Lu, Yanyan Shen, Su Chen, Beng Chin Ooi,2012). Hadoop a scalable fault-tolerant distributed system for data storage and processing it has two main components one is Hadoop Distributed File System (HDFS) The next component is MapReduce method this is will be used for fault-tolerant distributed processing. Fig.2 represents the architecture of Hadoop consists of one master node and many slave nodes. In the master node there will be a MapReduce model which is used for computation purpose. Fig. 2 Architecture of Hadoop HDFS comprises of numerous Datanodes for putting away information and an expert node called Namenode for observing Datanodes and keeping up all the meta-information. In HDFS, transported in information will be split into equivalent size lumps, and the Namenode dispenses the information pieces to diverse Datanodes (Wei Lu, Yanyan Shen, Su Chen, Beng Chin Ooi,2012). The Hadoop Map Reduce structure is intended to circulate stockpiling and calculation errands crosswise over many servers to empower assets to scale with interest while keeping up conservative in size. The HDFS building design comprises of a solitary Namenode, numerous Data nodes and the HDFS client A MapReduce program typically consists of a pair of user defined map and reduce functions. The map function is invoked for each record in the data, information sets and produces an apportioned, and sorted set of middle of the road results (Chi Zhang, Feifei Li, Jeffrey Jestes, 2012). The core idea behind MapReduce is mapping the dataset into collection of key/value pair and then reducing all pairs with the same key (Dean J, Ghemawat S,2014). The master node takes the input, divides it into smaller subproblems and distributes them to worker nodes. This work efficiently on very large amounts of highdimensional data. The paper is organized as follows. Previous related works are explained in section 2. Section 3, gives a brief introduction to proposing system, then section 4 is devoted to the System architecture, and section 5 contains the conclusion of this study. LITERATURE SURVEY In the literature support, the various research papers are surveyed to find the problems with Big data and also to find the solutions for solving those problems. In data mining and data warehousing, 95% of the time is spent on gathering and retrieving the data and only 5% of the time is spent on analyzing the data. Data mining and data warehousing cannot process large amount of data in parallel. The Multidimensional database system is used for storing and managing the data in Data mining and data warehousing. Data mining is a single technology which applies many older computational techniques from statistics, machine learning and pattern recognition. To overcome the above problems we use Big Data. Big Data processing large amount of data by using MapReduce Technology and Parallel Database

Technology is combined with MapReduce in order to perform parallel computation. The following research papers are discussed below. Big Data integrates storage, analysis, management, processing and application together in parallel with the help of MapReduce and Parallel Database Technology so that the efficiency of large amount of data is improved this will be known as Exploration on Big Data Oriented Data Analyzing and Processing Technology. The MapReduce architecture also provides better scalability and fault tolerance mechanisms (Chi Zhang, Feifei Li, Jeffrey Jestes, 2012). Ensemble clusters aims to join numerous groups together for forecast. For a given test set, each cluster will derive a label vector ( A. Strehl et al., 2012). MapReduce simple parallel programming abstraction in distributed environment. Moving forward, Hadoop was developed as the open-source version of GFS and MapReduce (Arinto Murdopo,2013). To perform computation for large amount of data because Key-value storage system, data versioning and partitioning algorithm is used to provide reliability, which highly increases scalability, reliability and durability when large amount of data is used (] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels,2007). PROXIMUS framework is used for reducing large data sets into smaller data sets. PROXIMUS use both sub sampling and compression of data before applying computationally expensive algorithms. It improves performance and scalability by means of algebraic technique and data structure (XIAO DAWEI, AO LEI,2013). The nearest neighbor technique is implemented in which searching an object is expensive in high dimensional data space. It is highly time consuming. The search time is O(dn log n) ( S. Arya, D. Mount, N. Netanyahu, R. Silverman, A. Wu,1994). A High quality decision tree is produced by dividing the data set into training, scoring and test sets (Zhiwei Fu, Fannie Mae,2001). Local greedy search is used throughout the dataset so the consumption of the search time is less. PROPOSED SYSTEM The classification technique is used to classify the whole dataset before mapping the tasks into the resources so that it reduces the time span, whereas during the later period each and every data of whole dataset were analyzed individually and then mapped into the resources which consumes more time to complete the task. To classify and analyze the data before mapping, an Ensemble Classifier is used along with MapReduce model and Parallel Database technology to increase the efficiency and throughput of Big data. Earlier each and every data were analyzed individually, so it takes more time to decide to which resource the tasks has to be allocated. In the MapReduce model, the map step will map the tasks into the resources based on the key/value pair to perform computations and the reduce step will aggregate all the results from the map step and finally produce a single output. The Parallel Database Technology is used to perform the computation of large data in parallel, which also improves the performance of Big data. The input in this project will be the large dataset in which the whole dataset will be classified and analyzed before mapping the tasks into the resources by reducing the time span to analyze the data. An Ensemble Classifier is the group of different classifiers which make the classifiers to process in parallel and also shares the knowledge of fastest processing classifier to others. Therefore, the data s will be processed with minimal scheduling time. Along with Ensemble Classifier, Map Reduce model and Parallel Database Technology are used which increases the efficiency and throughput of Big Data by reducing the scheduling time. System Architecture: An Ensemble Classifier is a group of different classifiers in which the classifiers will be made to perform in parallel and the knowledge of the fastest processing classifier will be shared to other

classifiers. An Ensemble classifier will give more accurate results when compared to other individual classifiers. Fig 4 represents proposed architecture that is The dataset will be loaded into an Ensemble Classifier. There are three classifiers used, namely Support Vector Machine, K-Nearest Neighbor and Decision Tree. Fig. 4 System Architecture SVM is supervised learning algorithms that analyze data and recognize patterns, used for classification and regression analysis. The basic SVM takes a set of input data and predicts, for each given input, which of two possible classes forms the output, making it a non-probabilistic binary linear classifier. The SVM checks the incoming dataset and verifies whether the data are knowledgeable data or not. If the incoming data is a knowledgeable data, then the support vector machine will support the incoming data to proceed for processing. If the incoming data is a new data then that data will be analyzed by the SVM. After analyzing the new data that data can be used for further processing. Decision Tree Classifier is a simple and widely used classification technique. It applies a straightforward idea to solve the classification problem. Decision Tree Classifier poses a series of carefully crafted questions about the attributes of the test record. Each time it receives an answer, a follow-up question is asked until a conclusion about the class label of the record is reached. The Decision Tree classifier will look for the incoming dataset and will split the dataset based on the category wise (Mr. D. V. Patil, Prof. Dr. R. S. Bichkar, 2006). It splits the attributes in the dataset. The attribute which has the highest information gain will be chosen as a splitting attribute. The K-NN is a non-parametric method for classification and regression that predicts objects' "values" or class memberships based on the k closest training examples in the feature space. K-NN is a type of instance-based learning, or lazy learning where the function is only approximated locally and all computation is deferred until classification. The k-nearest neighbor algorithm is amongst the simplest of all machine learning algorithms: an object is classified by a majority vote of its neighbors, with the object being assigned to the class most common amongst its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor. The incoming dataset will be analyzed by K-NN and if the data is similar to the data present in the nearest neighbor then that data will be accepted and processed. If the incoming data is not similar to the data present in nearest neighbor, then that data will be analyzed for further processing. Parallel computing includes two aspects: data parallel processing and task parallel processing. In terms of the data parallel processing means, a large-scale task to be solved can be disassembled into various system sub-tasks with the same scale and then each sub-task will be processed. As such, compared to the whole task, it will be easy to process. Adopting the task paralleling processing mode

might cause the disposal of the tasks and coordination of relationships overly complicated. Using the parallel database technology is a means for realizing the parallel processing of data information. Parallel database supports standard SQL language, through the SQL to provide data access service, SQL is widely used because it is simple and easy to apply. But in the big data analysis, the SQL interface is facing great challenges. The advantage of SQL comes from packaging the underlying data access, but the packaging affects its openness to a certain extent. User-defined functions which provided by parallel database is mostly based on the design of a single database instance, and therefore they cannot be executed in parallel cluster, it means that the traditional way is not suitable for the processing and analysis of big data. CONCLUSION The study of the MapReduce programming model is done to reduce the workloads on the resources and also to allocate the tasks into the resources. The Parallel Database Technology is used to perform the computation tasks in parallel, which increases the performance of Big data. In order to reduce the scheduling time for allocating the tasks into resources, classification technique is used before MapReduce and Parallel Database technology. For classifying the tasks an Ensemble classifier is used. An Ensemble classifier is a group of different classifiers such as Support Vector Machine (SVM) classifier, Decision Tree classifier, K-Nearest Neighbor (KNN) classifier etc. The study of these various types of classifiers is done to share the knowledge of the fastest processing classifier to others which will greatly reduce the scheduling time. Along with Ensemble Classifier, MapReduce programming model and Parallel Database Technology are used to increase the efficiency and throughput of Big data. REFERENCES Arinto Murdopo(2013), Distributed Decision Tree Learning for Mining Big Data Streams, July. S. Arya, D. Mount, N. Netanyahu, R. Silverman, A. Wu, An Optimal Algorithm for Approximate Nearest Neighbor Searching in Fixed Dimensions, Proc. Fifth Symp. Discrete Algorithm (SODA), 1994, pp. 573-582. Chi Zhang, Feifei Li, Jeffrey Jestes(2012), Efficient Parallel knn Joins for Large Data in MapReduce. Dean J, Ghemawat S(2004). MapReduce: Simplified data processing on large clusters / Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDΓ 04). San Francisco, California, USA: 137-150. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels(2007), Dynamo: Amazon s Highly Available Key-value Store,SOSP 07, October 14-17. Mr. D. V. Patil, Prof. Dr. R. S. Bichkar(2006), A Hybrid Evolutionary Approach To Construct Optimal Decision Trees with Large Data Sets, IEEE. A. Strehl et al.(2002), Cluster Ensembles-A Knowledge Reuse Framework for Combining Partitionings, JMLR, Vol.(3). Wei Lu, Yanyan Shen, Su Chen, Beng Chin Ooi (2012), Efficient Processing of k Nearest Neighbor Joins using MapReduce, Proceedings of the VLDB Endowment, Volume 5 Issue 10, June.

XIAO DAWEI, AO LEI(2013), Exploration on Big Data Oriented Data Analyzing and Processing Technology, Vol. 10. Zhiwei Fu, Fannie Mae(2001), A Computational Study of Using Genetic Algorithms to Develop Intelligent Decision Trees, Proceedings of the 2001 IEEE congress on evolutionary computation.