Novel Framework for Distributed Data Stream Mining in Big data Analytics Using Time Sensitive Sliding Window

Transcription

1 ISSN(Print): ISSN(Online): JOURNAL OF COMPUTER SCIENCE AND SOFTWARE APPLICATION In Press Novel Framework for Distributed Data Stream Mining in Big data Analytics Using Time Sensitive Sliding Window V. Valli Mayil* Director/MCA, Vivekanandha Institute of Information and Management Studies, Tiruchengode *Corresponding author: Abstract: In the digital world, large volume of data has been continuously generating from sensors, social media sites, videos, purchase transaction records, and cell phone GPS signals. These types of data are called big data. Big data is used to describe collection of data sets so large and complex, structured and unstructured types and it becomes difficult to process using on-hand data management tools or traditional data processing applications. Due to complex types and volumes of big data, the traditional static database and mining procedure is not suitable for Big Data Analytics. Predicting patterns in such dynamic environment is a challenging task in big data analytics. This paper proposes a novel frame work for mining frequent patterns in real time dynamic environment based on time sensitive sliding window. Our framework proposes a distributed mining which predicts frequent patterns from continuous data streams over various tilted time window. The framework suggests the distributed file system to store the continuous data stream and tilted time window model to hold the part of stream of data. The system recommends data distribution model, so that data window are distributed to different processing commodity node and the frequent pattern mining procedure is applied on separate node simultaneously. In this paper we have proposed framework which will use the power of Hadoop for mining the frequent Itemset in a Distributed environment. Keywords: Data Stream; Tilted Time Window; Big Data; Frequent Pattern; Hadoop; Mapreduce 1. INTRODUCTION Now a days tremendous and volume of data streams are generated from real time surveillance systems, communication networks, Internet traffic, online transactions, electric Power grids etc. Different types of structured and unstructured data flows continuously at high speed with varying rate. Every big data source has different characteristics, including the frequency, volume, velocity, type, and veracity of the data. Choosing an architecture and building an appropriate solution for bigdata is a challenging task as it would consider Volume, Variety and Velocity of data. Here the volume refers to the amount of data generated and variety refers to types of both structured and unstructured data. The structured data can be stored in terms of spreadsheets, databases where as the unstructured data is in the form of messages, images, videos, PDFs and audio files. This variety of unstructured data creates the problems of storing, mining and analyzing data. The big Data is also be characterized by another property called 1

2 velocity which deals the rate of data flows from sources like business processes, machines, networks and human interaction with things like social media sites, mobile devices, etc. The uniqueness of big data can be described as 1. Massive: High volume of data ranges from terabyte to yota byte 2. Temporarily ordered 3. Fast changing and Continuity : Data continuously arrive at a high rate 4. Expiration : data can be read only once 5. Potentially infinite : total amount of data is unbounded 6. Types of data : Unstructured, structured and semi structured data The above factor leads to the following requirement in our proposed distributed framework. Single scan: The big data volume starts from terabytes or even petabytes. The massive volume of data should be monitored at every fraction of time continuously by the sensor or processing system. The repeated scans or multiple scans are not possible in the increasing rate of bigdata. Hence our proposed system requires implementing single scan of bigdata storage. Streaming data Model: It is an analytic computing platform that is focused on speed. Our proposed system requires the model as the bigdata applications require a continuous stream of unstructured data to be processed. The data is continuously analyzed and transformed in memory before it is stored on a disk. Processing streams of data is adopted by splitting the streams in a model called time windows which can then be processed across a cluster of servers. Sliding Window model: In this model, the part of bigdata streams can be analysed for producing an approximate answer to a data stream. It evaluates only the recent data over sliding window. Distributed file system: Due to unlimited amount of data and limited system resources, such as memory space and CPU power, a mining algorithm distributes data to multiple splits and splits are assigned to multiple processing nodes for easy manipulation. Hadoop and MapReduce: The group of streaming data can be assigned to clusters of commodity system and parallel processing can be adopted to find the pattern mining in each frame. Mapreduce techniques shuffles and merge the final result into predicting patterns Parallel processing: allotted slots of massive data are to be processed simultaneously. Approximation: A method for providing approximate answer with accuracy guarantee is required. Adjustability: Due to unlimited amount of data, a mechanism that adapts itself to available resources is needed. Unlike a traditional data base system, the big data system produces a data streams continuously from different sources with high speed. It is a time varying, multiple, unbounded data, rapidly generated in a dynamic environment. Due to the characteristics of stream data, there are some inherent challenges for retrieving, storing and manipulating the data. First, the data is retrieving at high speed, there is not enough time to rescan the whole database or perform multi scan. Second, there is not enough space to store all the stream data for on line processing. The mining method of data stream needs to adapt to their changing data distribution. Third, the speed of mining algorithm should be faster than the data coming rate otherwise the approximation algorithm can be engaged, which will reduce the accuracy of results. Fourth, due to the changing data distribution characteristics of data, the analysis results of data stream needs a proper updation. The mining algorithm should be an incremental process to keep up with the highly updata rate. 2

3 Novel Framework for Distributed Data Stream Mining in Big data Analytics Using Time Sensitive Sliding Window 2. RELATED WORK 2.1 Stream Data Processing Models As seen the characteristics and challenges of big data in previous section, it is impractical to scan through the entire data stream more than once. For the effective processing of stream data, the new data structure, algorithm and techniques are needed. In this section, we discuss some of the common data structures and techniques to retrieve the most recent data stream for the analysis. According to the research, the authors Zhu and Shasha [1], have discussed three data processing models such as Landmark, Damped and Sliding window. The landmark model mines the frequent item set over the entire history of stream data from a specified time point called landmark to the present. This model does not provide the most recent updation in a data stream. It is not applicable for the user like stock monitoring and user who are interested to obtain current and real time information. The damped model (also called Time-Fading model), mines frequent item set in stream data based on weight assigned to them. Older transactions of stream have higher weight compared to the recent one. This will be suited for application in which old transaction has an effective in the mining procedures. Sliding window: The sliding window model mines the patterns within the window size of data. The part of data within the window are stored and processed. In [2], the author proposed an algorithm to mine the frequent item sets of data stream within the current sliding window. The size of window may be decided according to applications and system resources. Imposing sliding windows on data streams is a natural method for approximation that has several attractive properties. It is well-defined and easily understood: the semantics of the approximation are clear, so that users of the system can be confident that they understand what is given up in producing the approximate answer. It is deterministic, so there is no danger that unfortunate random choices will produce a bad approximation. Most importantly, it emphasizes recent data, which in the majority of real-world applications is more important and relevant than old data. if one is trying in real-time to make sense of network traffic patterns, or phone call or transaction records, or scientific sensor data, then in general insights based on the recent past will be more informative and useful than insights based on stale data. In [3], the author Yi Yang and Guojan discussed the new type of self adaptive sliding window model which can learn the sliding window control parameter that meets the demand when patterns are mined from data streams. Sampling, Load Sheding and Synopsis Data Model: These models follow the techniques for choosing subset of streams into process and provide the approximate results. Sampling refers to the process of probabilistic choice of data item to be processed. The drawback of sampling is the unknown data set size. Load shedding discussed in [4] refers to the process of dropping a sequence of data streams. Creating synopsis [5] of data refers to the process of applying summarization techniques that are capable of summarizing the incoming stream for future analysis. Histogram: It is a data structure used to approximate the frequency distribution of element values in a data stream. A histogram partitions the data into set of contiguous buckets. Hash based methods are proposed to mine frequent items. This method has been extended with lossy counting model. Multi resolution method: A common way to deal with a large amount of data by encompassing divide and conquer strategies such as multiresolution data structures. The method offers the ability to understand a data stream at multiple levels of detail. The clustering method is employed here to store the multiple levels of stream data. The hierarchical clustering data structure like CF tree in BIRCH to form a hierarchical clusters. All the above approaches provide the approximate answers for long term data and adjust their storage requirement based on available space. Therefore they satisfy two requirements called approximation and adjustability. 3

4 2.2 Mining Techniques in Distributed Environment Mining Data streams is concerned with extracting knowledge represented in models and patterns from non stopping streams of information. It requires a special architecture and methods. The authors in [6] discussed SAMOA - a platform for online mining in a cluster/cloud environment. It features a pluggable architecture that allows it to run on several distributed stream processing. It includes algorithms for the most common machine learning tasks such as classification and clustering. FP stream Approach in [7] suggests the procedure to mine frequent itemset under time fading model. In this model, frequent items are stored in a compact tree representation, called FP tree. In this approach, the continuous streams are divided under time window model. Each window maintains the frequent patterns and their counts. FP tree is also built for each window. It is suitable to mine the recent data only. The authors in paper [5] discussed different methods and issues in data stream management. The paper discussed different mining techniques such as Association, clustering and classification methods for data stream mining. The author in [8] proposed general method for scaling up machine learning algorithm called Very fast Machine learning. Mining frequent itemsets in dynamic databases [9, 10] has been studied over the last decade. In this method the frequent items and support counts are derived from original database and it is updated when new transactions are added to the data bases. Recent research has been done under mining of frequent items over data stream using sliding window model [10, 11]. In [12] the author proposed a distributed algorithm which imposes low communication overhead for mining distributed datasets. Minimum support threshold and frequency count is given interactively for all the window transactions. Finally, results are merged together to find frequent patterns of dataset in the window W. In order to mine the frequent item from data stream the appropriate single scan frequent mining algorithm is applied. The author in [13] proposed new methods for mining frequent itemsets in parallel on the MapReduce framework. The method Eclat, distributes the search space as evenly as possible among mappers. In [14] the author presented a new map-reduce based algorithm addressing problem of mining frequent itemsets using dynamic workload management through a blockbased partitioning. In [15] the author suggests the distributed file system HDFS and described NIM (Network Intelligence Miner) which is a scalable and elastic streaming solution that analyzes traffic patterns in real-time, and provides information for real-time decision making. 3. PROBLEM DEFINITION AND ANALYSIS Objective of our work is to propose a distributed framework for the analysis of Big data stream and predicts the complete set of frequent patterns from data stream. Our work model combines time sensitive window with distributed and parallel processing. Distributed and parallel processing heavily relies on data partitioning such as break down a large data set into multiple pieces that can be processed by independent processors. Let us consider the stream of continuous data, that is built on time sensitive sliding window (W), which can be divided into different splits of streams such as W=(w1,w2,w3,...,wn) over various recent time slots called (t1,t2,t3,...,tn). The splitted streams (w1,w2..wn) are applied to Hadoop Distributed File System (HDFS) and mapreduce paradigm to obtain frequent patterns. Hadoop is an open source distributed framework which is designed based on the Google s Map-reduce programming model. Hadoop is capable of analyzing large amount of data and supports write once read many access models. Hadoop has its own file system called Hadoop Distributed File system (HDFS) which is capable of running on commodity hardware with high fault tolerance ability. Data replication is one of the important features of HDFS, 4

5 Novel Framework for Distributed Data Stream Mining in Big data Analytics Using Time Sensitive Sliding Window which ensures data availability and automatic re-execution on multiple node failure. In this paper we have proposed algorithm which will use the power of Hadoop for mining the patterns in distributed environments Let us consider n different window W = {w1,w2,... wn}, each has a size m number of transactions and each window has been created in a time slots t0,t1,t2... tn The frequent patterns are to be obtained for all the windows transactions. Here we consider the block partitioning for the distribution of the datasets among all processing nodes. The Hadoop HDFS system allocated window transactions into M clusters of nodes in a server. Let s assume size of the partition Ti as Di. Now each partition Ti is divided into bi blocks {t1,t2,...,tbi }. The size of a block ti is defined as a default value of 64MB or according to the available memory in the processing node Ni and number of items, the average transaction width, and also the support threshold of a dataset. In order to minimize the cost of computation the window transactions are distributed to multiple commodity nodes. The distributed, parallelization and incremental updation environment was adopted in our framework. 4. PROPOSED SYSTEM COMPONENTS We would like to propose the framework based on distributed and mapreduce technology for a dynamic environment such as big data analytics. We propose our model based on tilted time frame model with distributed file system. The following important components are needed to propose our system. 4.1 Time Sensitive Sliding Window The sources of stream data are generating unstructured or semi structured data rapidly. We propose to divide the data stream based on time sensitive sliding window model. The sliding-window model of computation is motivated by the assumption that, recent data is more useful and pertinent than older data. The entire stream in a window W with size m is divided into n splits (w1,w2,w3,... wn), and each window is generated based on time slots (t1,t2,...,tn). 4.2 Hadoop Distributed File System (HDFS) HSFS in a Hadoop technology allocated the splits (subset) of data stream into data node in the commodity clusters of server. Data in a Hadoop cluster is broken down into smaller pieces (called blocks) and distributed throughout the cluster. In this way, the map and reduce functions can be executed on smaller subsets of your larger data sets, and this provides the scalability that is needed for big data processing. It enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software s ability to detect and handle failures at the application layer. Hadoop enables a computing solution that is: Scalable New nodes can be added as needed, and added without needing to change data formats, and coding data is loaded, how jobs are written, or the applications on top. Cost effective Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data. Flexible Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of 5

6 sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide. Fault tolerant When you lose a node, the system redirects work to another location of the data and continues processing without missing a fright beat. 4.3 Hadoop MapReduceframework MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster. Processing can be done on data stored either in a filesystem (unstructured) or in a database (structured). MapReduce can take advantage of locality of data, processing it on or near the storage assets in order to reduce the distance over which it must be transmitted. Map step: Each worker node applies the map() function to the local data, and writes the output to a temporary storage. A master node orchestrates that for redundant copies of input data, only one is processed. Shuffle step: Worker nodes redistribute data based on the output keys (produced by the map() function), such that all data belonging to one key is located on the same worker node. Reduce step: Worker nodes now process each group of output data, per key, in parallel. a set of reducers can perform the reduction phase, provided that all outputs of the map operation that share the same key are presented to the same reducer at the same time, or that the reduction function is associative. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled. 4.4 Pattern Tree Structure An efficient and compact data structure is proposed to store and update the collected information. This due to bounded memory size and large amount of data streams coming continuously. For each window the pattern tree is maintained to store the frequent patterns and its counts. Each node in the tree represents a pattern and its frequency is recorded in the tree. Figure 1 shows structure of pattern tree. 5. SYSTEM FRAMEWORK The proposed framework employs distributed and parallel processing with hadoop architecture in order to obtain frequent patterns from the fixed size of data streams. Figure 2 shows the proposed layout of the system. The HDFS in a hadoop allocates the data stream splits into the default size of block in commodity clusters of hardware which is called Map function. Every transactions and replication of block is also maintained as a meta data in a name node. Another node called data node in the server performs frequent pattern mining in the block of stream data. A MapReduce computation has two phases map phase and reduce phase. The input to the computation is a data set of key/value pairs. It partition input key/value pairs into chunks, run map() tasks in parallel. After all map()s are completed, it consolidate all emitted values for each unique emitted key. It partition space of output map keys, and run reduce() in parallel. Reducer generates output and stores it in HDFS. Tasks in each phase are executed in a fault-tolerant manner. Having many map and reduce tasks enables good load 6

7 Novel Framework for Distributed Data Stream Mining in Big data Analytics Using Time Sensitive Sliding Window Figure 1. Pattern Tree. balancing. It provides an interface which is independent of backend technology. All nodes in the cluster of server perform mining operations and the result will be in key, value pair. The output of all map phase will be given to the shuffling and sorting phase. The results are clustere and each reducer will get key as item pair and value as list of all transactions in which that item pair occur. For each such key value pair the reducer will calculate the sum of all transaction in which that item pair occur and compare it with minimal support and emit the output item pair as key and value as null. In this way we get frequent item pairs with less database scan but it will increase the number of in-memory computation to generate combination. Input: W- Data Stream in a specified Window F-set of frequent patterns Minsup Minimum support threshold Steps 1. Split the window W in to sliding window frames (w1,w2,w3,... wn) 2. Assign block size = 64 MB in Hadoop framework 3. Initiate Name service and metadata information META for the stream W 4. Loading HDFS to allocate {w1,w2,w3,...,wn) streams into data node (d1,d2,d3,...,dn) of server 5. Ensure replication of streams 6. Implement Frequent pattern mining using Pattern tree with minsup parameter 7. Map the function Freqmining() to all data node (d1,d2,d3,...,dn) 8. Shuffle and sort the results (key,value ) pair 9. Group by Key (k1,v1), (k2,v2)... (kx,vx) 7

8 Figure 2. Proposed System Framework Distributed Data Stream Pseudo Code: distributed Frequent pattern Mining from data Stream. 10. Reduce function r1,r2... rx // As many reduce function as many number of key 11. Frequent pattern F=(f1,f2... ) 6. CONCLUSION Big data stream is a continuous data generated from real time environment which need to be processed and analysed immediately for decision making. The traditional system with single scan and low processing power is not sufficient to perform such an operation. In this paper we suggested to adopt data distribution like hadoop technology to manage the continuous flow of data distribution and simultaneous processing. The processing involves in finding frequent patterns from data splits. The splits are maintained in a tilted time frame window. Frequent pattern mining algorithm with haddop is employed to find the frequent itemset. 8

9 Novel Framework for Distributed Data Stream Mining in Big data Analytics Using Time Sensitive Sliding Window References [1] Y. Zhu and D. Shasha, Statstream: Statistical monitoring of thousands of data streams in real time, in Proceedings of the 28th International Conference on Very Large Data Bases, pp , VLDB Endowment, [2] C. Jin, W. Qian, C. Sha, J. X. Yu, and A. Zhou, Dynamically maintaining frequent items over a data stream, in Proceedings of the Twelfth International Conference on Information and Knowledge Management, pp , ACM, [3] Y. Yang and G. Mao, A Self-Adaptive Sliding Window Technique for Mining Data Streams, in Intelligence Computation and Evolutionary Computation, pp , Springer, [4] N. Tatbul, U. Çetintemel, S. Zdonik, M. Cherniack, and M. Stonebraker, Load shedding on data streams, in Proceedings of the Workshop on Management and Processing of Data Streams (MPDS 03), San Diego, CA, USA, [5] M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy, Mining data streams: a review, ACM Sigmod Record, vol. 34, no. 2, pp , [6] G. De Francisci Morales, SAMOA: A platform for mining big data streams, in Proceedings of the 22nd International Conference on World Wide Web Companion, pp , International World Wide Web Conferences Steering Committee, [7] C. Giannella, J. Han, J. Pei, X. Yan, and P. S. Yu, Mining frequent patterns in data streams at multiple time granularities, Next Generation Data Mining, vol. 212, pp , [8] P. Domingos and G. Hulten, Mining high-speed data streams, in Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp , ACM, [9] D. W. Cheung, J. Han, V. T. Ng, and C. Wong, Maintenance of discovered association rules in large databases: An incremental updating technique, in Data Engineering, Proceedings of the Twelfth International Conference on, pp , IEEE, [10] C.-H. Lee, C.-R. Lin, and M.-S. Chen, Sliding-window filtering: an efficient algorithm for incremental mining, in Proceedings of the Tenth International Conference on Information and Knowledge Management, pp , ACM, [11] D. J. Abadi, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik, Aurora: a new model and architecture for data stream management, The VLDB JournalThe International Journal on Very Large Data Bases, vol. 12, no. 2, pp , [12] L. Zeng, L. Li, L. Duan, K. Lu, Z. Shi, M. Wang, W. Wu, and P. Luo, Distributed data mining: a survey, Information Technology and Management, vol. 13, no. 4, pp , [13] L. Zhou, Z. Zhong, J. Chang, J. Li, J. Huang, and S. Feng, Balanced parallel fp-growth with mapreduce, in Information Computing and Telecommunications (YC-ICT), 2010 IEEE Youth Conference on, pp , IEEE, [14] L. Dhamdhere Jyoti and B. Deshpande Kiran, A Novel Methodology of Frequent Itemset Mining on Hadoop, [15] L. Pan, J. Qian, C. He, W. Fan, C. He, and F. Yang, NIM: Scalable Distributed Stream Process System on Mobile Network Data, in Data Mining Workshops (ICDMW), 2013 IEEE 13th International Conference on, pp , IEEE,