Research Article www.ijptonline.com EFFICIENT TECHNIQUES TO DEAL WITH BIG DATA CLASSIFICATION PROBLEMS G.Somasekhar 1 *, Dr. K.

Similar documents
Big Data with Rough Set Using Map- Reduce

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

D A T A M I N I N G C L A S S I F I C A T I O N

A Big Data Analytical Framework For Portfolio Optimization Abstract. Keywords. 1. Introduction

How To Use Neural Networks In Data Mining

Random forest algorithm in big data environment

Manjeet Kaur Bhullar, Kiranbir Kaur Department of CSE, GNDU, Amritsar, Punjab, India

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

International Journal of Engineering Research ISSN: & Management Technology November-2015 Volume 2, Issue-6

Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)

Classification algorithm in Data mining: An Overview

An Overview of Knowledge Discovery Database and Data mining Techniques

A New Method for Traffic Forecasting Based on the Data Mining Technology with Artificial Intelligent Algorithms

Research on the Performance Optimization of Hadoop in Big Data Environment

Specific Usage of Visual Data Analysis Techniques

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Web Mining using Artificial Ant Colonies : A Survey

DATA MINING TECHNIQUES AND APPLICATIONS

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

A Review of Data Mining Techniques

American International Journal of Research in Science, Technology, Engineering & Mathematics

A Data Generator for Multi-Stream Data

Data Mining Part 5. Prediction

Data Mining & Data Stream Mining Open Source Tools

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

The University of Jordan

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Keywords data mining, prediction techniques, decision making.

BIG DATA IN HEALTHCARE THE NEXT FRONTIER

Data Mining and Machine Learning in Bioinformatics

International Journal of Software and Web Sciences (IJSWS)

Adaptive Classification Algorithm for Concept Drifting Electricity Pricing Data Streams

Biogeography Based Optimization (BBO) Approach for Sensor Selection in Aircraft Engine

CLOUD DATABASE ROUTE SCHEDULING USING COMBANATION OF PARTICLE SWARM OPTIMIZATION AND GENETIC ALGORITHM

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA

Professor Anita Wasilewska. Classification Lecture Notes

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Data Mining for Knowledge Management. Classification

Intrusion Detection. Jeffrey J.P. Tsai. Imperial College Press. A Machine Learning Approach. Zhenwei Yu. University of Illinois, Chicago, USA

Computational intelligence in intrusion detection systems

A REVIEW ON EFFICIENT DATA ANALYSIS FRAMEWORK FOR INCREASING THROUGHPUT IN BIG DATA. Technology, Coimbatore. Engineering and Technology, Coimbatore.

Using Data Mining for Mobile Communication Clustering and Characterization

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Effect of Using Neural Networks in GA-Based School Timetabling

International Journal of Innovative Research in Computer and Communication Engineering

A Review And Evaluations Of Shortest Path Algorithms

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

Data Mining Applications in Higher Education

Improving Decision Making and Managing Knowledge

A hybrid Approach of Genetic Algorithm and Particle Swarm Technique to Software Test Case Generation

GA as a Data Optimization Tool for Predictive Analytics

Big Data: Study in Structured and Unstructured Data

Chapter 6. The stacking ensemble approach

Open Access Research on Application of Neural Network in Computer Network Security Evaluation. Shujuan Jin *

A Binary Model on the Basis of Imperialist Competitive Algorithm in Order to Solve the Problem of Knapsack 1-0

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

A Survey on Parallel Method for Rough Set using MapReduce Technique for Data Mining

A Survey of Classification Techniques in the Area of Big Data.

A Survey on Intrusion Detection System with Data Mining Techniques

Effective Data Mining Using Neural Networks

An Introduction to Data Mining

Big Data Classification: Problems and Challenges in Network Intrusion Prediction with Machine Learning

Classification and Prediction

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Processing of Big Data. Nelson L. S. da Fonseca IEEE ComSoc Summer Scool Trento, July 9 th, 2015

Review on Financial Forecasting using Neural Network and Data Mining Technique

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

How To Classify Data Stream Mining

Nine Common Types of Data Mining Techniques Used in Predictive Analytics

Data Mining Solutions for the Business Environment

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Memory Allocation Technique for Segregated Free List Based on Genetic Algorithm

A Survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery

Projects - Neural and Evolutionary Computing

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge

Using News Articles to Predict Stock Price Movements

Extraction of Satellite Image using Particle Swarm Optimization

Study on Cloud Computing Resource Scheduling Strategy Based on the Ant Colony Optimization Algorithm

Flexible Neural Trees Ensemble for Stock Index Modeling

Aggregation Methodology on Map Reduce for Big Data Applications by using Traffic-Aware Partition Algorithm

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

INCREMENTAL AGGREGATION MODEL FOR DATA STREAM CLASSIFICATION

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Performance Analysis of Data Mining Techniques for Improving the Accuracy of Wind Power Forecast Combination

Proposal of Credit Card Fraudulent Use Detection by Online-type Decision Tree Construction and Verification of Generality

Machine Learning using MapReduce

Introduction to Data Mining Techniques

processed parallely over the cluster nodes. Mapreduce thus provides a distributed approach to solve complex and lengthy problems

Transcription:

ISSN: 0975-766X CODEN: IJPTFI Available Online through Research Article www.ijptonline.com EFFICIENT TECHNIQUES TO DEAL WITH BIG DATA CLASSIFICATION PROBLEMS G.Somasekhar 1 *, Dr. K.Karthikeyan 2 1 Research Scholar/SCSE, VIT University,Vellore,Tamil Nadu,India. 2 Associate Professor/SAS, VIT University,Vellore,Tamil Nadu,India. Email: gidd.somasekhar2014@vit.ac.in, k.karthikeyan@vit.ac.in Received on 05-08-2015 Accepted on 25-08-2015 Abstract Big data analytics is the process of examining large data sets containing a variety of data types i.e., big data to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information. It is the extension of data mining. Many traditional techniques in data mining can be made suitable for big data either by little modifications or by the combination of more than one data mining technique. Mapreduce is a technique we can use in many big data applications. The main focus of this paper is on big data classification. The techniques those can be used for big data classification are discussed. These classification techniques can be implemented in all big data scenarios. Keywords: Classification, big data, analytics, mapreduce, ensemble learning, fuzzy set approach, incremental algorithm, semi-supervised learning. Introduction Big data management is becoming crucial now a days because of the evolution of large number of social networking websites, powerful mobile devices, sensors, and cloud computing. According to IDC analysis, the global data volume is going to grow 44 times between 2009 and 2020. It may also go beyond limits such that we cannot control. The existing technology and infrastructure may not support to maintain these large chunks of data. Big data became a buzz word today leading to revolutionary changes in data processing, data storage and data analytics. As the information is the basic need for any kind of development, society needs more big data techniques to extract useful information from big data. The big data tools like Hadoop evolved as primary tools for any developing organization to deal with big data. IJPT Sep-2015 Vol. 7 Issue No.2 8942-8950 Page 8942

Characteristics of Big Data and application of big data analytics Big data should have 4 V s which are termed as fundamental characteristics of big data. i) Volume: The real time big data scenarios are demanding huge volumes of data ranging from peta bytes to exa bytes. It may go beyond exa bytes also. ii) iii) iv) Variety: Collection of different types of data from different data sources. Velocity: Speed of data generation and data updation. Value: Valuable knowledge extraction and decision making. Some of the challenges of big data include limited main memory, data security, data recovery, data processing, and maintaining the balance between ethical values and big data management. Big data analytics evolved as a subject of extracting interesting patterns from big data to support decision making process. It is applied in many fields like finance, medical, bio-informatics, science, space, retail industry etc. Some applications and big data algorithms are mentioned in Table-1: Applications of big data, the algorithms and computing methods. Big Data Classification: Problems Classification is a data mining technique to extract categorical labels or classes from a large data set. When this data set has the characteristics of big data, it is termed as big data classification. As the big data has several challenges to overcome (mentioned in Table-1 section 2), the traditional classification techniques may not be suitable to handle big data classification problem. Modification of conventional classification algorithms and applying any big data technique is inevitable to meet the big data processing needs. In addition to these, the big data classification algorithms need to be scalable and incremental. They have to solve the problems of big stream data like concept drift, infinite length, concept evolution, and feature evolution. The following section explains briefly the efficient big data classification techniques. IJPT Sep-2015 Vol. 7 Issue No.2 8942-8950 Page 8943

Techniques for Big Data Classification G.Somasekhar* et al. International Journal Of Pharmacy & Technology a) Application of Mapreduce to traditional data classification algorithms[6]: There is the possibility that both lazy learners and eager learners can be subjected to map reduce to get accurate classification result with in less time. Lazy learners include K-Nearest Neighbor Classifier and Case Based Reasoning(CBR).Eager learners include the classification algorithms such as Bayesian classification, Decision tree induction, Rule based classification, Classification by back propagation and so on. Should be map reduces strategy and its application to data classification algorithms are depicted in fig 1 and fig 2 below. Fig 1 : Sample application of Mapreduce for shape counter. Fig 2: Application of Mapreduce on a traditional data classification technique. b) Semi supervised learning: Semi-supervised learning [5] is the mixture of both data classification (supervised learning) and data clustering (unsupervised learning) as depicted in fig 3 below. Building classifiers becomes much difficult, labour intensive, cost consuming and time consuming in real time big data scenarios. It is often the case that we may have a small number of labeled samples to train a few classifiers, but a large number of unlabeled samples are available to build clusters from big data. In such cases we can choose this classification technique. IJPT Sep-2015 Vol. 7 Issue No.2 8942-8950 Page 8944

Fig 3: Generation of mixed ensemble by semi-supervised learning. c) Fuzzy set approach: In fig.4, the membership values of x in each fuzzy set do not have to total to 1.Each x may be the member of two or more fuzzy sets. (Here x is the value of income). Fig 4: Graph representation of fuzzy membership values for fuzzy sets low_income, medium_income and high_income in a sample employee data set. For example, m medium_income ($49k) = 0.15, and m high_income ($49k) = 0.96 m medium_income ($49k) + m high_income ($49k) 1.(Where m( ) is the membershipfunction). The above approach is called fuzzy set approach[7] which is based on fuzzy set theory.fuzzy set theory is also known as possibility theory which is very useful in dealing with vague or inexact facts in big data applications. Fuzzy rules and fuzzy models can be derived from fuzzy sets. Fuzzy logic systems can be used in numerous areas for big data classification including market research, finance, health care, and environmental engineering. IJPT Sep-2015 Vol. 7 Issue No.2 8942-8950 Page 8945

d) Incremental learning: G.Somasekhar* et al. International Journal Of Pharmacy & Technology Properties of the incremental classification[8]: i) Updates a classifier dynamically using the test data. ii) iii) iv) No need to store all train data in main memory. Flexibility to modify a model based on newly trained records. The classifier can adapt to gradual concept drift problem of big stream data. Example: Very Fast Decision Tree (VFDT), and Concept adapting Very Fast Decision Tree(CVFDT) algorithms: The traditional Hoeffding Tree algorithm is modified in three ways to get VFDT algorithm. The three modifications are, i) Aggressive breaking of near ties during attribute selection. ii) iii) Deactivating least promising leaves Dropping poor splitting attributes To handle the concept drift problem efficiently, VFDT is again converted into CVFDT. Both VFDT and CVFDT can deal with the big stream data classification problems improving the speed, memory utilization and scalability. Other examples include incremental decision tree, incremental Bayesian classification and so on. e) Ensemble learning: The incremental learning cannot remove old records from a classifier. This limitation of incremental learning lead to new learning method called Ensemble learning[8]. It divides large data stream into small data chunks. For each chunk, an independent classifier is built. Finally a set of n number of top most classifiers based on heuristic methods is obtained where n is the size of the ensemble and majority voting method is applied to get the label for a test tuple as depicted in fig 5. Fig 5: Ensemble learning. IJPT Sep-2015 Vol. 7 Issue No.2 8942-8950 Page 8946

Advantages: i) As each data chunk is relatively small compared to entire data, classifier construction cost per chunk is very less. ii) As we are storing the classifier of a chunk instead of storing all the chunk related to a classifier, the memory is saved. iii) It can adapt to rigorous concept drift problem of big stream data also. f) Genetic Algorithms: Genetic algorithms (GAs) are a particular class of evolutionary algorithms involving inheritance, mutation, selection and cross-over techniques of biology. GAs use binary strings to encode features of an individual. The main advantage of GAs is they are easily parallelizable and more suitable for big data classification. Mapreduce strategy is well applicable here. Basic Genetic Algorithm Step 1: Randomly select initial population. Step 2: Repeat the following steps until terminated. Step 3: Evaluate each individual s fitness. Step 4: Prune population. Step 5: Select pairs to mate from best-ranked individuals. Step 6: Replenish population using selected pairs.(apply cross-over,mutation etc.) Step 7: Add or replace generated member to population. Step 8: Check for termination criteria. Step 9: end repeat. g) Rough set approach: Rough set theory [10] is a powerful mathematical tool developed by Z. Pawlak in the early 1980s. It can be applied widely to extract knowledge from database. It discovers hidden patterns in data by identifying partial and total dependencies in data. It also works with null or missing values. Rough set methods work very well in dealing with uncertainties. Rough sets can be used together with other methods such as fuzzy sets, statistic methods, genetic algorithms etc. to get mixed benefits or it can be map reduced to get the advantage of scalability in real time big data scenarios. A rough set depends on upper and lower approximations which are explained below based on fig.6. IJPT Sep-2015 Vol. 7 Issue No.2 8942-8950 Page 8947

1. Lower approximation: The lower approximation consists of all the data pertaining to class C, without any ambiguity based on attributes. 2. Upper approximation: The objects are probably belong to class C, cannot be described as not belonging to class C based on the knowledge of the attributes. 3. Boundary region The differences between these lower and upper approximations define the boundary region of the rough set. The set is crisp, if boundary region is empty. Or set is rough, if the boundary region is nonempty. Rough set deals with vagueness and uncertainty emphasized in decision making. The equivalence classes are represented by rectangular regions in fig.6. Fig 6: Rough set approximation of the tuples belonging to class C, using upper and lower approximation sets. h) Swarm intelligence: Swarm intelligence is inspired by the swarm behavior [9] of insects, flocks and birds. Swarm is generally a group of several agents helping each other to achieve common goal. The agents follow local rules to execute their actions and with the help of entire group they achieve their objective. Particle Swarm Optimization (PSO) and Ant Colonies Optimization (ACO) are the most popular examples for Swarm Intelligence. This technique can be implemented in real time where big data is distributed over huge network. i) Artificial neural networks: Artificial Neural Networks [9] are basically the computational models that consist of number of processing units that communicates to one another over a large network by sending signals. They are inspired by human brains. In terms of biology, neuron collects signals from other neurons through Dendrites. The main important feature of this algorithm is IJPT Sep-2015 Vol. 7 Issue No.2 8942-8950 Page 8948

that we can learn from examples so that we can ignore programming. A sample neural network approach is depicted in fig 7 below. Fig 7: A sample neural network approach. j) Co-evolutionary programming[9]: It is based on the fact that the individuals of the two populations evolve through either competing against each other or through co-operating each other. The fitness function involving the relationship with other individuals is used in this technique. In competitive approach, the fitness of an individual in population is completely based on the fitness of an individual in other population whereas in cooperative approach, the fitness of an individual purely depends on degree of cooperation with the other individual in another population. This technique can be applied in big data classification algorithms. Conclusion: Big data management is becoming crucial now a days. Traditional data mining techniques need to be reviewed and modified to deal with big data problems. This paper focused on the big data classification. Efficient big data classification strategies and techniques are discussed. The big data research should be encouraged to get new ideas to handle big data. In future, we would focus on big data clustering. References 1. G. Noseworthy, Infographic: Managing the Big Flood of Big Data in Digital Marketing, 2012 http://analyzingmedia.com/2012/infograp hic-big-flood-of-big-data-in-digitalmarketing/. 2. H. Moed, The Evolution of Big Data as a Research and Scientific Topic: Overview of the Literature, 2012, ResearchTrends, http://www.researchtrends.com. 3. MIKE 2.0, Big Data Definition, http://mike2.openmethodology.org/wiki/bigdata Definition. IJPT Sep-2015 Vol. 7 Issue No.2 8942-8950 Page 8949

4. P. Zikipoulos, T. Deutsch, D. Deroos, Harness the Power of Big Data, 2012, http://www.ibmbigdatahub.com/blog/harnes s-power-big-data-book-excerpt. 5. Peng Zhang, Xingquan Zhu, Jianlong Tan and Li Guo Classifier and Cluster Ensembles for Mining Concept Drifting Data Streams,10 th IEEE International Conference on Data Mining, pp. 1175 1180,2010. 6. AhsanulHaque, Brandon Parker and Latifur Khan, Labeling Instances in Evolving Data Streams withmapreduce, Big Data Congress, IEEE, pp. 387-394,2013. 7. Mukkamala,R.R.,Hussain,A.,Vatrapu,R., Fuzzy-Set Based Sentiment Analysis of Big Social Data,In Proc.of IEEE 18 th international Enterprise distributing object computing conference,pp.71-80,2014. 8. WenyuZang, Peng Zhang, Chuan Zhou and Li Guo, Comparative study between incremental and ensemble learning on data streams: Case study, Journal Of Big Data 2014,1:5,Springer,2014. 9. Neha Khan, Mohd Shahid Husain, Mohd Rizwan Beg, Big Data Classification using Evolutionary Techniques: A Survey,In Proc.of IEEE International Conference on Engineering and Technology (ICETECH),pp. 243-247,2015. 10. Prachi Patil, Data Mining with Rough Set Using Map-Reduce, International Journal of Innovative Research in Computer and Communication Engineering, Vol. 2, Issue 11,pp.6980-6986, 2014. Corresponding Author: G.Somasekhar*, Email: gidd.somasekhar2014@vit.ac.in IJPT Sep-2015 Vol. 7 Issue No.2 8942-8950 Page 8950