Index Terms Big Data, Machine Learning, Supervised learning, unsupervised learning. Reinforcement learning.

Transcription

1 Volume 5, Issue 4, April 2015 ISSN: X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: Special Issue on Impact of Technology on Skill Development Conference Held at IETE Amravati Center, Maharashtra, India Machine Learning in Big Data Analytics: An Overview Prof. Kanchan M. Tarwani, Prof. Saleha S. Saudagar, Prof. Harshal D. Misalkar Abstract Big data is much more than storage of and access to data. Big data Analytics plays an important role in making sense of the data and exploiting its value. But it s a significant challenge to learn and develop new types of machine learning algorithms. Scaling up big data to proper dimensionality is an challenge that can encounter in machine learning algorithms plus there are challenges of dealing with velocity, volume and many more for all types of machine learning algorithms. Here, in this paper, we are first exploring big data concept, bringing with an urgent need for advanced data acquisition, management, and analysis mechanisms, we have presented the concept of big data and highlighted the four phases of big data that are generating data, acquisition of data, storing this large data, and then analyzing data. The next phase of this paper, focuses on dealing with big data using machine learning(ml), and highlighted the three ML methods: supervised learning, unsupervised learning and reinforcement learning and its impact on big data. Index Terms Big Data, Machine Learning, Supervised learning, unsupervised learning. Reinforcement learning. I. INTRODUCTION The emerging big-data paradigm, owing to its broader impact, has profoundly transformed our society and will continue to attract diverse attentions from both technological experts and the public in general. It is obvious that we are living a data deluge era, evidenced by the sheer volume of data from a variety of sources and its growing rate of generation. For instance, an IDC report [1] predicts that, from 2005 to 2020, the global data volume will grow by a factor of 300, from 130 exabytes to 40,000 exabytes, representing a double growth every two years. The term of ``big-data'' was coined to capture the profound meaning of this data-explosion trend and indeed the data has been touted as the new oil, which is expected to transform our society. Machine learning can be defined as a Field of study that gives computers the ability to learn without being explicitly programmed. It s a science of algorithms says that, the algorithms learn from the dataset, identifying patterns or classifying trends for instance, and then automates output- whether that s sorting data into categories or making predictions on future outputs. Machine Learning is now quite a matured discipline and captures in general any activity that involves automated learning from data or experience. At the core of machine learning is the ability of a software or machine to improve the performance of certain tasks through being exposed to data and experiences. Machine learning model first learns the knowledge from the data it is exposed to and then applies this knowledge to deliver predictions about the new data which is previously unseen data [1] [7]. The quality of the predictions delivered by machine learning model depend on a number of factors: how well the relevant knowledge is represented by the module, how effectively complete and representative were the learning data, how easily we can forecast the problem in general. Most of the times, machine learning became a very good weapon at certain recognition, identification or categorization tasks like fingerprint, faces or voice recognition [1][4]. Likewise recent clustering algorithms became very good at automatically grouping people according to their profiles, identify market segments, forming communities, and even segment images or distinguish genes [1], [6]. The successful application of ML models to these problems was possible not only thanks to the learning model ingenuity but primarily thanks to the very careful data with the properties like transformative, identifiable, and generation of multidimensional that made the learning problem discriminative enough to make easy distinctions between the predicted targets based on the data [1], [2], [6]. Depending upon the depth of knowledge that is available for learning, machine learning models can be categorized into supervised, unsupervised and semisupervised learning algorithms [1], [3], [5], [7]. Big data is of two types: Structured data and Semi structured data. Structured data are numbers and words that can be easily categorized and analyzed. These data are generated by things like network sensors embedded in electronic devices, smart phones, and global positioning system (GPS) devices. Structured data also include things like account balances, sales figures, and transaction data. Unstructured data include more complex information such as reviews of customers and forums from commercial websites, photos and other multimedia, and comments on social networking sites. These data cannot easily be divided into categories or analyzed numerically. 2015, IJARCSSE All Rights Reserved Page 270

2 II. BIG DATA AND ITS IMPACT B. Big Data Phases A big-data system is complex, providing functions to deal with different phases in the digital data life cycle, ranging from its birth to its destruction. At the same time, the system usually involves multiple distinct phases for different applications [16], [17]. In this section, we focus on the chain for big data analytics. Specifically, we describe a big data chain that consists of four stages (generation, acquisition, storage, and processing). A. Layered Architecture for Big Data The big data system can be decomposed into a layered structure, as illustrated in Fig. 1. The layered structure is divisible into three layers, i.e., the infrastructure layer, the computing layer, and the application layer, from bottom to top. This layered view only provides a conceptual hierarchy to underscore the complexity of a big data system. The function of each layer is as follows. The infrastructure layer consists of a pool of ICT resources, which can be organized by cloud computing infrastructure and enabled by virtualization technology. These resources will be exposed to upper-layer systems in a _ne-grained manner with a special service-level agreement (SLA). Within this model, resources must be allocated to meet the big data demand while achieving resource efficiency by maximizing system utilization, energy awareness, operational simplification, etc. The computing layer encapsulates various data tools into a middleware layer that runs over raw ICT resources. In the context of big data, typical tools include data integration, data management, and the programming model. Data integration means acquiring data from disparate sources and integrating the dataset into a unified form with the necessary data pre-processing operations. Data management refers to mechanisms and tools that provide persistent data storage and highly efficient management, such as distributed _le systems and SQL or NoSQL data stores. The programming model implements abstraction application logic and facilitates the data analysis applications. Map Reduce [12], Dryad [13], Pregel [14], and Dremel [15] exemplify programming models. _ The application layer exploits the interface provided by the programming models to implement various data analysis functions, including querying, statistical analyses, clustering, and classification; then, it combines basic analytical methods to develop various field related applications. Fig. 1. Architecture for Big Data Fig. 2. Big Data Value Chain a. Data Generation Data Generation the first and most important phase of Big data chain. This section highlights The trends of big data generation that can be characterized by the data generation rate. Specifically, the data generation rate is increasing due to technological advancements. we roughly classify data generation patterns into three sequential stages: Stage I : The first stage began in the 1990s. As digital technology and database systems were widely adopted, many management systems in various organizations were storing large volumes of data, such as bank trading transactions, shopping mall records, and government sector archives. These datasets are structured and can be analyzed through database-based storage management systems. Stage II : The second stage began with the growing popularity of web systems. The Web 1.0 systems, characterized by web search engines and ecommerce businesses after the late 1990s, generated large amounts of semi-structured and/or unstructured data, including web pages and transaction logs. Since the early 2000s, many Web 2.0 applications created an abundance of user-generated content from online social networks, such as forums, online groups, blogs, social networking sites, and social media sites. Stage III : The third stage is triggered by the emergence of mobile devices, such as smart phones, tablets, sensors and sensor-based Internet-enabled devices. The mobile centric network has and will continue to create highly mobile, location-aware, person-centered, and context relevant data in the near future. With this classification, we can see that the data generation pattern is evolving rapidly, from passive recording in Stage I to active generation in Stage II and automatic production in Stage III. These three types of data constitute the primary sources of big data, of which the automatic production pattern will contribute the most in the near future. b. Data Acquisition As illustrated in the big data value chain, the task of the data acquisition phase is to aggregate information in a digital form for further storage and analysis. Intuitively, 2015, IJARCSSE All Rights Reserved Page 271

3 the acquisition process consists of three sub-steps, data instant the challenge came from all sides: accessing large collection, data transmission, and data pre-processing, as amounts of data, learning the models from it and carry out illustrated in Fig. 3. mass predictions all in a reasonable time. Data collection refers to the process of retrieving raw data from real-world objects. The process needs to be well Since big data processing requires decomposition, designed. Otherwise, inaccurate data collection would parallelism, modularity and/or recurrence, inflexible impact black-box type machine learning models failed in an the subsequent data analysis procedure and ultimately lead outset. In fact all machine learning algorithms with to computational complexity of O(n2) immediately become invalid results. intractable when faced with billions of data points. Data transmission refers to once we gather the raw data, we must transfer it into a data storage infrastructure, Depending on the depth of knowledge that is available for commonly in a data center, for subsequent processing. The learning, ML models can be categorized into supervised, transmission procedure can be divided into two stages, IP unsupervised and semi-supervised learning algorithms [1], backbone transmission and data center transmission. [3], [5], [7]. Fig. 3. Subtasks of Data Acquisition Data pre-processing is also of equal importance in data acquisition phase. Because of their diverse sources, the collected data sets may have different levels of quality in terms of noise, redundancy, consistency, etc. Transferring and storing raw data would have necessary costs. On the demand side, certain data analysis methods and applications might have strict requirements on data quality. As such, data preprocessing techniques that are designed to improve data quality should be in place in big data systems. c. Data Storage The data storage subsystem in a big data platform organizes the collected information in a convenient format for analysis and value extraction. For this purpose, the data storage subsystem should provide two sets of features: The storage infrastructure must accommodate information persistently and reliably. The data storage subsystem must provide a scalable access interface to query and analyze a vast quantity of data. This functional decomposition shows that the data storage subsystem can be divided into hardware infrastructure and data management. d. Data Analysis The last and most important stage of the big data chain is data analysis, the goal of which is to extract useful values, suggest conclusions and/or support decisionmaking. First, we discuss the purpose and classification metric of data analytics. Second, we review the application evolution for various data sources and summarize the six most relevant areas. Finally, we introduce several common methods that play fundamental roles in data analytics. III. MACHINE LEARNING FOR BIG DATA The core abilities of the data analytics and the machine learning to model, learn from, and predict data were very deeply affected by the emergence of big data. Almost in an A. Supervised Learning Supervised learning algorithms use labeled training examples. The input data and the target outputs are given explicitly for the model to learn the mapping or function between them. Once this is captured the model then uses the learned mapping and the new unseen input data to predict their outputs. Within supervised learning family we can further distinguish between classification models which focus on prediction of discrete (categorical) outputs or regression models which predict continues outputs. Among large number of models reported in the literature linear and nonlinear density based classifiers, decision trees, naïve Bayes, support vector machines, neural networks and nearest neighbor are the most frequently cited and applied in practical applications [1][5]. Decision trees despite their simplicity were particularly successful in various industrial applications due to their robustness, transparency and the portability of the model effectively resulting in a set of SQL queries. Despite their operational simplicity induction a decision tree is at best O(n log n) complex and still requires significant data reduction effort to handle very large data [1] [5]. Support vector machines (SVM) were also reported as very powerful in terms of achieved performance on the numerical data thanks to their explicit effort to maximise the margin of misclassification along the boundaries among the classes of data [1]- [5]. Traditional SVMs are quadratically complex O(n2), yet there are successful attempts to cleverly reduce their complexity to linear complexity [15]. We intend to expand on these developments. Neural networks (NN) are yet another example of very powerful, flexible and robust predictors for both classification and regression problems [1]- [5]. Its generic structure allows to learn virtually any function successfully with enough hidden layers of processing nodes and training data. Like for SVM, NNs are typically O(n3) or at best quadratic ally complex with the number of examples. Complex structures further expand computational demands of learning NN. There have been recent attempts to redesign NN into more scalable architectures [16], further work is required to make NNs fully scalable for big data processing. 2015, IJARCSSE All Rights Reserved Page 272

4 Fig. 4 captures a brief demonstration of a supervised TABLE I learning applied to the famous Iris species detection EXPONENTIAL GROWTH OF BIG DATA problem based on just 2-dimensional data of petal width 100 represnts Sr. No Terms the peak search content and petal length: 1 The Big Data Big Data Analytics 60 3 Data Analysis 55 4 Hadoop 45 5 Big Data Hadoop 45 6 GoopgleBig Data 30 7 IBM Big Data 20 8 Oracle Big Data 20 9 Big Data Cloud 20 This table shows the exponential growth of the big data google trends, along geographical distribution and the most popular related terms, source Fig.4 Demonstration of a supervised learning applied to the famous Iris species detection problem B. Unsupervised Learning Unsupervised learning typically referred to as clustering is concerned with recognizing natural grouping among the data i.e. separating similar from dissimilar data based on multidimensional data. Figure 2 visually depicts some examples of with k-means and mixture of Gaussians clustering methods. Many clustering methods from k-means through density based DBSCAN and hierarchical clustering models were reported for simple problems on small data sizes [6], [10] Traditional hierarchical clustering usually involves computation of the pair wise distances among the input data examples which normally involves quadratic ally complex computation step. Assigning data samples to different clusters based on the distance hierarchy poses yet another computational overhead. DBSCAN is one of the most cited clustering algorithms. It is based on the data density and uses the concept local reach ability among points to decide about cluster memberships. However, even if optimized with efficient indexing structure it is at best O(n log n) complex in the number of examples [6]. K-means clustering, on the other hand, is linearly complex in the number of examples and data dimensionality but assumes fixed number of algorithm iterations and fixed number of clusters. Due to its linear complexity it is however the most scalable clustering model to date which also evolved into a successful artificial neural network equivalent the Kohonen Nets [6]. Some recent research reported further improved scalability of k-means model when the average aggregation is replaced with the median [10]. We intend to expand on the promising scalable supervised and unsupervised learning models as well as develop the new models that flexibly manage the balance between the computational complexity of learning and the prediction performance through modularization and recurrence at different data granularity. C. Reinforcement learning Reinforcement learning lies somewhat in between supervised and unsupervised learning in a sense that learning is not done on the fully labeled or evaluated data but a hint in a form of local reward is provided to every actionable data. The objective of the reinforcement learning is to maximize the long term reward through exploration and learning the optimal action policy in response to the environmental data. Reinforcement learning found prime applications in robotics, agent technologies and many complex exploration systems requiring continuous response to the changing environment while maximizing strategic longer-term objectives IV. BIG DATA REVOLUTION The emergence of the notion of Big Data (BD) initiated the true revolution in the way we store, manage and process the data. Coming now in Exabyte per day big data and its popularity continue to grow exponentially as it can be seen from the Google Trend chart presented in the table 1. Big data are linked to the explosive growth in mobile devices and their ability to generate, collect, share and access massive amounts of textual, numeric, imagery and video data. Combined with a correlated expansion of internet resources, networked services and the cult of sharing on ever growing social networks what we are experiencing is a true data revolution. V. CONCLUSION The era of big data is upon us, bringing with it an urgent need for advanced data acquisition, management, and analysis mechanisms. In this paper, we have presented the concept of big data and highlighted the big data value chain, which covers the entire big data lifecycle. The big data value chain consists of four phases: data generation, data acquisition, data storage, and data analysis. The next part of this paper, focuses on dealing with big data using machine learning(ml), and highlighted the three ML methods: supervised learning, unsupervised learning and reinforcement learning and its impact on big data. 2015, IJARCSSE All Rights Reserved Page 273

5 REFERENCES AUTHORS Prof. Kanchan M. Tarwani Received MTech. in Information Technology from YCCE, Nagpur in Received B.E. degree in Computer Science and Engineering from PRMIT&R, Badnera, Sant Gadge baba Amravati university in [1]. T.M. Mitchell. Machine Learning. McGraw-Hill, New York, [2]. R.O. Duda, P.E. Hart and D.G. Stork. Pattern Classification. John Wiley & Sons, New York, [3]. C.M. Bishop. Pattern Recognition and Machine Learning. Springer, New York, [4]. S. Marsland. Machine Learning, An Algorithmic Perspective. Chapman and Hall / CRC Press, Boca Raton, [5]. A. Mohri, A. Rostamizadeh and A. Talwalker. Foundations of Machine Learning. The MIT Press, Cambridge, [6]. C.C. Aggarwal and S.K. Reddy. Data Clustering, Algorithms and Applications. Chapman and Hall / CRC, Boca Raton, [7]. O. Chapelle, B. Cholkopf, A. Zien. Semi- Supervised Learning (Adaptive Computation and Machine Learning Series). MIT Press, Cambridge, [8]. J. Liebowitz. Big Data and Business Analytics. CRC Press, Boca Raton, [9]. V. Mayer-Schonberger and K. Cukier. Big Data: A Revolution That Will Transform How We Live, Work and Think. Houghton Mifflin Harcourt Publishing, New York, [10]. YC Kwon, D. Nunley, J.P. Gardner, M. Balazinska, B. Howe, S. Loebman. Scalable Clustering Algorithm for N-Body Simulations in a Shared-Nothing Cluster. Scientific and Statistical Database Management 6187: , [11] J. Gantz and D. Reinsel, ``The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east,'' in Proc. IDC iview, IDC Anal. Future, [12] J. Dean and S. Ghemawat, ``Mapreduce: Simpli_ed data processing on large clusters,'' Commun. ACM, vol. 51, no. 1, pp. 107_113, [13] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, ``Dryad: Distributed data-parallel programs from sequential building blocks,'' in Proc. 2nd ACM SIGOPS/EuroSys Eur. Conf. Comput. Syst., Jun. 2007, pp. 59_72. [14] G. Malewicz et al., ``Pregel: A system for largescale graph processing,'' in Proc. ACM SIGMOD Int. Conf. Manag. Data, Jun. 2010, pp. 135_ S. Melnik et al., ``Dremel: Interactive analysis of web-scale datasets,'' Proc. VLDB Endowment, vol. 3, nos. 1_2, pp. 330_339, [16 ] E. B. S. D. D. Agrawal et al., ``Challenges and opportunities with big data_a community white paper developed by leading researchers across the united states,'' The Computing Research Association, CRA White Paper, Feb [17] D. Fisher, R. DeLine, M. Czerwinski, and S. Drucker, ``Interactions with big data analytics,'' Interactions, vol. 19, no. 3, pp. 50_59, May Prof. Saleha S. Saudagar Received MTech. in Computer Science and Engineering from SBITM Betul University, Madhya Pradesh in Received B.E. degree in Computer Science and Engineering from Government College of Engineering, Amravati in Prof. Harshal D. Misalkar Received M.E. in Information Technology from PRMIT&R, Badnera, Sant Gadge baba Amravati university in 2014.Received B.E. in Information Technology from PRMIT&R Badnera, Sant Gadge baba Amravati university in , IJARCSSE All Rights Reserved Page 274