ISSN: 0975-766X CODEN: IJPTFI Available Online through Research Article www.ijptonline.com EFFICIENT TECHNIQUES TO DEAL WITH BIG DATA CLASSIFICATION PROBLEMS G.Somasekhar 1 *, Dr. K.Karthikeyan 2 1 Research Scholar/SCSE, VIT University,Vellore,Tamil Nadu,India. 2 Associate Professor/SAS, VIT University,Vellore,Tamil Nadu,India. Email: gidd.somasekhar2014@vit.ac.in, k.karthikeyan@vit.ac.in Received on 05-08-2015 Accepted on 25-08-2015 Abstract Big data analytics is the process of examining large data sets containing a variety of data types i.e., big data to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information. It is the extension of data mining. Many traditional techniques in data mining can be made suitable for big data either by little modifications or by the combination of more than one data mining technique. Mapreduce is a technique we can use in many big data applications. The main focus of this paper is on big data classification. The techniques those can be used for big data classification are discussed. These classification techniques can be implemented in all big data scenarios. Keywords: Classification, big data, analytics, mapreduce, ensemble learning, fuzzy set approach, incremental algorithm, semi-supervised learning. Introduction Big data management is becoming crucial now a days because of the evolution of large number of social networking websites, powerful mobile devices, sensors, and cloud computing. According to IDC analysis, the global data volume is going to grow 44 times between 2009 and 2020. It may also go beyond limits such that we cannot control. The existing technology and infrastructure may not support to maintain these large chunks of data. Big data became a buzz word today leading to revolutionary changes in data processing, data storage and data analytics. As the information is the basic need for any kind of development, society needs more big data techniques to extract useful information from big data. The big data tools like Hadoop evolved as primary tools for any developing organization to deal with big data. IJPT Sep-2015 Vol. 7 Issue No.2 8942-8950 Page 8942
Characteristics of Big Data and application of big data analytics Big data should have 4 V s which are termed as fundamental characteristics of big data. i) Volume: The real time big data scenarios are demanding huge volumes of data ranging from peta bytes to exa bytes. It may go beyond exa bytes also. ii) iii) iv) Variety: Collection of different types of data from different data sources. Velocity: Speed of data generation and data updation. Value: Valuable knowledge extraction and decision making. Some of the challenges of big data include limited main memory, data security, data recovery, data processing, and maintaining the balance between ethical values and big data management. Big data analytics evolved as a subject of extracting interesting patterns from big data to support decision making process. It is applied in many fields like finance, medical, bio-informatics, science, space, retail industry etc. Some applications and big data algorithms are mentioned in Table-1: Applications of big data, the algorithms and computing methods. Big Data Classification: Problems Classification is a data mining technique to extract categorical labels or classes from a large data set. When this data set has the characteristics of big data, it is termed as big data classification. As the big data has several challenges to overcome (mentioned in Table-1 section 2), the traditional classification techniques may not be suitable to handle big data classification problem. Modification of conventional classification algorithms and applying any big data technique is inevitable to meet the big data processing needs. In addition to these, the big data classification algorithms need to be scalable and incremental. They have to solve the problems of big stream data like concept drift, infinite length, concept evolution, and feature evolution. The following section explains briefly the efficient big data classification techniques. IJPT Sep-2015 Vol. 7 Issue No.2 8942-8950 Page 8943
Techniques for Big Data Classification G.Somasekhar* et al. International Journal Of Pharmacy & Technology a) Application of Mapreduce to traditional data classification algorithms[6]: There is the possibility that both lazy learners and eager learners can be subjected to map reduce to get accurate classification result with in less time. Lazy learners include K-Nearest Neighbor Classifier and Case Based Reasoning(CBR).Eager learners include the classification algorithms such as Bayesian classification, Decision tree induction, Rule based classification, Classification by back propagation and so on. Should be map reduces strategy and its application to data classification algorithms are depicted in fig 1 and fig 2 below. Fig 1 : Sample application of Mapreduce for shape counter. Fig 2: Application of Mapreduce on a traditional data classification technique. b) Semi supervised learning: Semi-supervised learning [5] is the mixture of both data classification (supervised learning) and data clustering (unsupervised learning) as depicted in fig 3 below. Building classifiers becomes much difficult, labour intensive, cost consuming and time consuming in real time big data scenarios. It is often the case that we may have a small number of labeled samples to train a few classifiers, but a large number of unlabeled samples are available to build clusters from big data. In such cases we can choose this classification technique. IJPT Sep-2015 Vol. 7 Issue No.2 8942-8950 Page 8944
Fig 3: Generation of mixed ensemble by semi-supervised learning. c) Fuzzy set approach: In fig.4, the membership values of x in each fuzzy set do not have to total to 1.Each x may be the member of two or more fuzzy sets. (Here x is the value of income). Fig 4: Graph representation of fuzzy membership values for fuzzy sets low_income, medium_income and high_income in a sample employee data set. For example, m medium_income ($49k) = 0.15, and m high_income ($49k) = 0.96 m medium_income ($49k) + m high_income ($49k) 1.(Where m( ) is the membershipfunction). The above approach is called fuzzy set approach[7] which is based on fuzzy set theory.fuzzy set theory is also known as possibility theory which is very useful in dealing with vague or inexact facts in big data applications. Fuzzy rules and fuzzy models can be derived from fuzzy sets. Fuzzy logic systems can be used in numerous areas for big data classification including market research, finance, health care, and environmental engineering. IJPT Sep-2015 Vol. 7 Issue No.2 8942-8950 Page 8945
d) Incremental learning: G.Somasekhar* et al. International Journal Of Pharmacy & Technology Properties of the incremental classification[8]: i) Updates a classifier dynamically using the test data. ii) iii) iv) No need to store all train data in main memory. Flexibility to modify a model based on newly trained records. The classifier can adapt to gradual concept drift problem of big stream data. Example: Very Fast Decision Tree (VFDT), and Concept adapting Very Fast Decision Tree(CVFDT) algorithms: The traditional Hoeffding Tree algorithm is modified in three ways to get VFDT algorithm. The three modifications are, i) Aggressive breaking of near ties during attribute selection. ii) iii) Deactivating least promising leaves Dropping poor splitting attributes To handle the concept drift problem efficiently, VFDT is again converted into CVFDT. Both VFDT and CVFDT can deal with the big stream data classification problems improving the speed, memory utilization and scalability. Other examples include incremental decision tree, incremental Bayesian classification and so on. e) Ensemble learning: The incremental learning cannot remove old records from a classifier. This limitation of incremental learning lead to new learning method called Ensemble learning[8]. It divides large data stream into small data chunks. For each chunk, an independent classifier is built. Finally a set of n number of top most classifiers based on heuristic methods is obtained where n is the size of the ensemble and majority voting method is applied to get the label for a test tuple as depicted in fig 5. Fig 5: Ensemble learning. IJPT Sep-2015 Vol. 7 Issue No.2 8942-8950 Page 8946
Advantages: i) As each data chunk is relatively small compared to entire data, classifier construction cost per chunk is very less. ii) As we are storing the classifier of a chunk instead of storing all the chunk related to a classifier, the memory is saved. iii) It can adapt to rigorous concept drift problem of big stream data also. f) Genetic Algorithms: Genetic algorithms (GAs) are a particular class of evolutionary algorithms involving inheritance, mutation, selection and cross-over techniques of biology. GAs use binary strings to encode features of an individual. The main advantage of GAs is they are easily parallelizable and more suitable for big data classification. Mapreduce strategy is well applicable here. Basic Genetic Algorithm Step 1: Randomly select initial population. Step 2: Repeat the following steps until terminated. Step 3: Evaluate each individual s fitness. Step 4: Prune population. Step 5: Select pairs to mate from best-ranked individuals. Step 6: Replenish population using selected pairs.(apply cross-over,mutation etc.) Step 7: Add or replace generated member to population. Step 8: Check for termination criteria. Step 9: end repeat. g) Rough set approach: Rough set theory [10] is a powerful mathematical tool developed by Z. Pawlak in the early 1980s. It can be applied widely to extract knowledge from database. It discovers hidden patterns in data by identifying partial and total dependencies in data. It also works with null or missing values. Rough set methods work very well in dealing with uncertainties. Rough sets can be used together with other methods such as fuzzy sets, statistic methods, genetic algorithms etc. to get mixed benefits or it can be map reduced to get the advantage of scalability in real time big data scenarios. A rough set depends on upper and lower approximations which are explained below based on fig.6. IJPT Sep-2015 Vol. 7 Issue No.2 8942-8950 Page 8947
1. Lower approximation: The lower approximation consists of all the data pertaining to class C, without any ambiguity based on attributes. 2. Upper approximation: The objects are probably belong to class C, cannot be described as not belonging to class C based on the knowledge of the attributes. 3. Boundary region The differences between these lower and upper approximations define the boundary region of the rough set. The set is crisp, if boundary region is empty. Or set is rough, if the boundary region is nonempty. Rough set deals with vagueness and uncertainty emphasized in decision making. The equivalence classes are represented by rectangular regions in fig.6. Fig 6: Rough set approximation of the tuples belonging to class C, using upper and lower approximation sets. h) Swarm intelligence: Swarm intelligence is inspired by the swarm behavior [9] of insects, flocks and birds. Swarm is generally a group of several agents helping each other to achieve common goal. The agents follow local rules to execute their actions and with the help of entire group they achieve their objective. Particle Swarm Optimization (PSO) and Ant Colonies Optimization (ACO) are the most popular examples for Swarm Intelligence. This technique can be implemented in real time where big data is distributed over huge network. i) Artificial neural networks: Artificial Neural Networks [9] are basically the computational models that consist of number of processing units that communicates to one another over a large network by sending signals. They are inspired by human brains. In terms of biology, neuron collects signals from other neurons through Dendrites. The main important feature of this algorithm is IJPT Sep-2015 Vol. 7 Issue No.2 8942-8950 Page 8948
that we can learn from examples so that we can ignore programming. A sample neural network approach is depicted in fig 7 below. Fig 7: A sample neural network approach. j) Co-evolutionary programming[9]: It is based on the fact that the individuals of the two populations evolve through either competing against each other or through co-operating each other. The fitness function involving the relationship with other individuals is used in this technique. In competitive approach, the fitness of an individual in population is completely based on the fitness of an individual in other population whereas in cooperative approach, the fitness of an individual purely depends on degree of cooperation with the other individual in another population. This technique can be applied in big data classification algorithms. Conclusion: Big data management is becoming crucial now a days. Traditional data mining techniques need to be reviewed and modified to deal with big data problems. This paper focused on the big data classification. Efficient big data classification strategies and techniques are discussed. The big data research should be encouraged to get new ideas to handle big data. In future, we would focus on big data clustering. References 1. G. Noseworthy, Infographic: Managing the Big Flood of Big Data in Digital Marketing, 2012 http://analyzingmedia.com/2012/infograp hic-big-flood-of-big-data-in-digitalmarketing/. 2. H. Moed, The Evolution of Big Data as a Research and Scientific Topic: Overview of the Literature, 2012, ResearchTrends, http://www.researchtrends.com. 3. MIKE 2.0, Big Data Definition, http://mike2.openmethodology.org/wiki/bigdata Definition. IJPT Sep-2015 Vol. 7 Issue No.2 8942-8950 Page 8949
4. P. Zikipoulos, T. Deutsch, D. Deroos, Harness the Power of Big Data, 2012, http://www.ibmbigdatahub.com/blog/harnes s-power-big-data-book-excerpt. 5. Peng Zhang, Xingquan Zhu, Jianlong Tan and Li Guo Classifier and Cluster Ensembles for Mining Concept Drifting Data Streams,10 th IEEE International Conference on Data Mining, pp. 1175 1180,2010. 6. AhsanulHaque, Brandon Parker and Latifur Khan, Labeling Instances in Evolving Data Streams withmapreduce, Big Data Congress, IEEE, pp. 387-394,2013. 7. Mukkamala,R.R.,Hussain,A.,Vatrapu,R., Fuzzy-Set Based Sentiment Analysis of Big Social Data,In Proc.of IEEE 18 th international Enterprise distributing object computing conference,pp.71-80,2014. 8. WenyuZang, Peng Zhang, Chuan Zhou and Li Guo, Comparative study between incremental and ensemble learning on data streams: Case study, Journal Of Big Data 2014,1:5,Springer,2014. 9. Neha Khan, Mohd Shahid Husain, Mohd Rizwan Beg, Big Data Classification using Evolutionary Techniques: A Survey,In Proc.of IEEE International Conference on Engineering and Technology (ICETECH),pp. 243-247,2015. 10. Prachi Patil, Data Mining with Rough Set Using Map-Reduce, International Journal of Innovative Research in Computer and Communication Engineering, Vol. 2, Issue 11,pp.6980-6986, 2014. Corresponding Author: G.Somasekhar*, Email: gidd.somasekhar2014@vit.ac.in IJPT Sep-2015 Vol. 7 Issue No.2 8942-8950 Page 8950