A Way to Understand Various Patterns of Data Mining Techniques for Selected Domains

Similar documents
IncSpan: Incremental Mining of Sequential Patterns in Large Database

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge

Analysis of Data Mining Algorithms

Notation: - active node - inactive node

SPMF: a Java Open-Source Pattern Mining Library

Mining various patterns in sequential data in an SQL-like manner *

Mining the Most Interesting Web Access Associations

MAXIMAL FREQUENT ITEMSET GENERATION USING SEGMENTATION APPROACH

Discovering Sequential Rental Patterns by Fleet Tracking

A Survey on Association Rule Mining in Market Basket Analysis

MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM

Directed Graph based Distributed Sequential Pattern Mining Using Hadoop Map Reduce

Mining Association Rules: A Database Perspective

DEVELOPMENT OF HASH TABLE BASED WEB-READY DATA MINING ENGINE

Performance Analysis of Sequential Pattern Mining Algorithms on Large Dense Datasets

PREDICTIVE MODELING OF INTER-TRANSACTION ASSOCIATION RULES A BUSINESS PERSPECTIVE

MINING CLICKSTREAM-BASED DATA CUBES

A Hybrid Data Mining Approach for Analysis of Patient Behaviors in RFID Environments

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: X DATA MINING TECHNIQUES AND STOCK MARKET

ASSOCIATION RULE MINING ON WEB LOGS FOR EXTRACTING INTERESTING PATTERNS THROUGH WEKA TOOL

Binary Coded Web Access Pattern Tree in Education Domain

Issues for On-Line Analytical Mining of Data Warehouses. Jiawei Han, Sonny H.S. Chee and Jenny Y. Chiang

Fast Discovery of Sequential Patterns through Memory Indexing and Database Partitioning

An Effective System for Mining Web Log

Selection of Optimal Discount of Retail Assortments with Data Mining Approach

Classification and Prediction

KNOWLEDGE DISCOVERY and SAMPLING TECHNIQUES with DATA MINING for IDENTIFYING TRENDS in DATA SETS

Comparison of Data Mining Techniques for Money Laundering Detection System

Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm

Significant Interval and Frequent Pattern Discovery in Web Log Data

Mining Sequence Data. JERZY STEFANOWSKI Inst. Informatyki PP Wersja dla TPD 2009 Zaawansowana eksploracja danych

On Mining Group Patterns of Mobile Users

A Comparative study of Techniques in Data Mining

WEB SITE OPTIMIZATION THROUGH MINING USER NAVIGATIONAL PATTERNS

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

Improving Apriori Algorithm to get better performance with Cloud Computing

An application for clickstream analysis

A Fraud Detection Approach in Telecommunication using Cluster GA

Integration of Data Mining Systems using Sequence Process

Building A Smart Academic Advising System Using Association Rule Mining

Data Mining Approach in Security Information and Event Management

A Knowledge Mining Framework for Business Analysts

New Matrix Approach to Improve Apriori Algorithm

VILNIUS UNIVERSITY FREQUENT PATTERN ANALYSIS FOR DECISION MAKING IN BIG DATA

Computer Science in Education

Analysis of Large Web Sequences using AprioriAll_Set Algorithm

Association Rule Mining: A Survey

Data Mining Algorithms And Medical Sciences

Web Mining Patterns Discovery and Analysis Using Custom-Built Apriori Algorithm

Association Rules Mining: A Recent Overview

Improved Data mining approach to find Frequent Itemset Using Support count table

Performance Evaluation of some Online Association Rule Mining Algorithms for sorted and unsorted Data sets

Word Taxonomy for On-line Visual Asset Management and Mining

A Time Efficient Algorithm for Web Log Analysis

Knowledge Mining for the Business Analyst

Mining Multi Level Association Rules Using Fuzzy Logic

Relational Databases and Data Warehouses æ. Jiawei Han Jenny Y. Chiang Sonny Chee Jianping Chen Qing Chen

PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY

Top Top 10 Algorithms in Data Mining

Implementing Improved Algorithm Over APRIORI Data Mining Association Rule Algorithm

An Efficient GA-Based Algorithm for Mining Negative Sequential Patterns

EVENT CENTRIC MODELING APPROACH IN CO- LOCATION PATTERN ANALYSIS FROM SPATIAL DATA

Future Trend Prediction of Indian IT Stock Market using Association Rule Mining of Transaction data

Distributed Apriori in Hadoop MapReduce Framework

Top 10 Algorithms in Data Mining

Clustering Data Streams

Using Data Mining Methods to Predict Personally Identifiable Information in s

A Clustering Model for Mining Evolving Web User Patterns in Data Stream Environment

International Journal of Engineering Research ISSN: & Management Technology November-2015 Volume 2, Issue-6

DWMiner : A tool for mining frequent item sets efficiently in data warehouses

An Efficient Frequent Item Mining using Various Hybrid Data Mining Techniques in Super Market Dataset

An Overview of Knowledge Discovery Database and Data mining Techniques

Association Rule Mining

A COGNITIVE APPROACH IN PATTERN ANALYSIS TOOLS AND TECHNIQUES USING WEB USAGE MINING

Scoring the Data Using Association Rules

HYBRID INTRUSION DETECTION FOR CLUSTER BASED WIRELESS SENSOR NETWORK

Statistical Learning Theory Meets Big Data

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

Application of Data Mining Techniques in Intrusion Detection

Discovering Interesting Association Rules in the Web Log Usage Data

WebAdaptor: Designing Adaptive Web Sites Using Data Mining Techniques

Information Management course

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Searching frequent itemsets by clustering data

AN IMPROVED PRIVACY PRESERVING ALGORITHM USING ASSOCIATION RULE MINING(27-32) AN IMPROVED PRIVACY PRESERVING ALGORITHM USING ASSOCIATION RULE MINING

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

An Empirical Study of Application of Data Mining Techniques in Library System

An Improved Algorithm for Fuzzy Data Mining for Intrusion Detection

Association Technique on Prediction of Chronic Diseases Using Apriori Algorithm

Cloud Computing Environments Parallel Data Mining Policy Research

RUBA: Real-time Unstructured Big Data Analysis Framework

A Software System for Spatial Data Analysis and Modeling *

Scientific Report. BIDYUT KUMAR / PATRA INDIAN VTT Technical Research Centre of Finland, Finland. Raimo / Launonen. First name / Family name

International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April ISSN

Philosophies and Advances in Scaling Mining Algorithms to Large Databases

Continuous Fastest Path Planning in Road Networks by Mining Real-Time Traffic Event Information

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph

WEBLOG RECOMMENDATION USING ASSOCIATION RULES

Laboratory Module 8 Mining Frequent Itemsets Apriori Algorithm

SEQUENCE MINING ON WEB ACCESS LOGS: A CASE STUDY. Rua de Ceuta 118, 6 andar, Porto, Portugal csoares@liacc.up.pt

Transcription:

A Way to Understand Various Patterns of Data Mining Techniques for Selected Domains Dr. Kanak Saxena Professor & Head, Computer Application SATI, Vidisha, kanak.saxena@gmail.com D.S. Rajpoot Registrar, UIT RGPV, Bhopal dsrphd@yahoo.com Abstract: This has much in common with traditional work in statistics and machine learning. However, there are important new issues which arise because of the sheer size of the data. One of the important problem in data mining is the Classificationrule learning which involves finding rules that partition given data into predefined classes. In the data mining domain where millions of records and a large number of attributes are involved, the execution time of existing algorithms can become prohibitive, particularly in interactive applications. 1. Introduction : An enormous amount of data stored in databases and data warehouses, it is increasingly important to develop [1] powerful tools for analysis of such data and mining interesting knowledge from it. Data mining [4] is a process of inferring knowledge from such huge data. It has five major components: - Association rules - Classification or clustering - Characterization & Comparison - Sequential Pattern Analysis. - Trend Analysis An association rule [5] is a rule which implies certain association relationships among a set of objects in a database. In this process we discover a set of association rules at multiple levels of abstraction from the relevant set(s) of data in a database. For example, one may discover a set of symptoms [2] often occurring together with certain kinds of diseases and further study the reasons behind them. Since finding interesting association rules in databases may disclose some useful patterns for decision support, selective marketing, financial forecast, medical diagnosis and many other applications, it has attracted [3] a lot of attention in recent data mining research. Mining association rules may require iterative scanning of large transaction or relational databases which is quite costly in processing. 2. A brief review of the work already done in the field : Sequential pattern mining is an interesting data mining problem with many real-world applications. This problem has been studied extensively in static databases. However, in recent years, emerging applications have introduced a new form of data called data stream. In a data stream [6], new elements are generated continuously. This poses additional constraints on the methods used for mining such data: memory usage is restricted, the infinitely [8] flowing original dataset cannot be scanned multiple times, and current results should be available on demand. Mendes, L.F. Bolin Ding, Jiawei Han [9] introduces two effective methods for mining sequential patterns from data streams: the SS-BE method and the SS-MB method. The proposed methods break the stream into 186 http://sites.google.com/site/ijcsis/

batches and only process each batch once. The two methods use different pruning strategies [10] that restrict the memory usage but can still guarantee that all true sequential patterns are output at the end of any batch. Both algorithms scale linearly in execution time as the number of sequences grows, making them effective methods for sequential pattern mining in data streams. The experimental results also show that our methods are very accurate in that only a small fraction of the patterns that are output are false positives. Even for these false positives, SS-BE guarantees that their true support is above a pre-defined threshold. Previous studies have shown mining closed patterns provides more benefits than mining the complete set of frequent patterns, since closed pattern mining leads to more compact results and more efficient algorithms. It is quite useful in a data stream environment where memory and computation power are major concerns. Lei Chang Tengjiao Wang Dongqing Yang Hua Luan [20] studies the problem of mining closed sequential patterns over data stream sliding windows. An efficient algorithm SeqStream is developed to mine closed sequential patterns in stream windows incrementally, and various novel strategies are adopted in SeqStream [7] to prune search space aggressively. Extensive experiments on both real and synthetic data sets show that SeqStream outperforms PrefixSpan, CloSpan and BIDE by a factor of about one to two orders of magnitude. The input data is a set of sequences, called datasequences. Each data sequence is ordered list of transactions (or itemsets), where each transaction is a sets of items (literals). Typically there is a transactiontime associated with each transaction. A sequential pattern also consists of a list of sets of items. The problem is to find all sequential patterns with a userspecified minimum support, where the support of a sequential pattern is the percentage of data sequences that contain the pattern. The framework of sequential pattern discovery is explained here using the example of a customer transaction database as by Agrawal & Srikant [11]. The database is a list of time-stamped transactions for each customer that visits a supermarket and the objective is to discover (temporal) buying patterns that sufficiently many customers exhibit. This is essentially an extension (by incorporation of temporal ordering information into the patterns being discovered) of the original association rule mining framework proposed for a database of unordered transaction records (Agrawal et al 1993) [12] which is known as the Apriori algorithm. Since there are many temporal pattern discovery algorithms that are modeled along the same lines as the Apriori algorithm, it is useful to first understand how Apriori works before discussing extensions to the case of temporal patterns. Let D be a database of customer transactions at a supermarket. A transaction is simply an unordered collection of items purchased by a customer in one visit to the supermarket. The Apriori algorithm [13] systematically unearths all patterns in the form of (unordered) sets of items that appear in a sizable number of transactions. We introduce some notation to precisely define this framework. A non-empty set of items is called an itemset. An itemset i is denoted by (i 1,i 2,i 3, i m ), where i j is an item. Since i has m items, it is sometimes called an m-itemset. Trivially, each transaction in the database is an itemset. However, given an arbitrary itemset i, it may or may not be contained in a given transaction T. The fraction of all transactions in the database in which an itemset is contained in is called the support of that itemset. An 187 http://sites.google.com/site/ijcsis/

itemset whose support exceeds a user-defined threshold is referred to as a frequent itemset. These itemsets [14] are the patterns of interest in this problem. The brute force method of determining supports for all possible itemsets (of size m for various m) is a combinatorially explosive exercise and is not feasible in large databases (which is typically the case in data mining). The problem therefore is to find an efficient algorithm to discover all frequent itemsets in the database D given a user-defined minimum support threshold. The Apriori algorithm exploits the following very simple (but amazingly useful) principle: if i and j are itemsets such that j is a subset of i then the support of j is greater than or equal to the support of i. Thus, for an itemset to be frequent all its subsets must in turn be frequent as well. This gives rise to an efficient levelwise construction of frequent itemsets in D. The algorithm makes multiple passes over the data. Starting with itemsets of size 1 (i.e. 1-itemsets), every pass discovers frequent itemsets of the next bigger size. The first pass over the data discovers all the frequent 1- itemsets. These are then combined to generate candidate 2-itemsets and by determining their supports (using a second pass over the data) the frequent 2- itemsets are found. Similarly, these frequent 2-itemsets are used to first obtain candidate 3-itemsets and then (using a third database pass) the frequent 3-itemsets are found, and so on. The candidate generation before the m th pass uses the Apriori principle described above as follows: an m-itemset is considered a candidate only if all (m 1)-itemsets contained in it have already been declared frequent in the previous step. As m increases, while the number of all possible m-itemsets grows exponentially, the number of frequent m-itemsets grows much slower, and as a matter of fact, starts decreasing after some m. Thus the candidate generation method in Apriori makes the algorithm efficient. This process of progressively building itemsets of the next bigger size is continued till a stage is reached when (for some size of itemsets) there are no frequent itemsets left to continue. This marks the end of the frequent itemset discovery process. 3. Note Worthy Contribution in the field of proposed work : Mendes, L.F. Bolin Ding, Jiawei Han [21] introduces two effective methods for mining sequential patterns from data streams: the SS-BE method and the SS-MB method. The proposed methods break the stream into batches and only process each batch once. The two methods use different pruning strategies that restrict the memory usage but can still guarantee that all true sequential patterns are output at the end of any batch. Both algorithms scale linearly in execution time as the number of sequences grows, making them effective methods for sequential pattern mining in data streams. The experimental results also show that our methods are very accurate in that only a small fraction of the patterns that are output are false positives. Even for these false positives, SS-BE guarantees that their true support is above a pre-defined threshold. Lei Chang Tengjiao Wang Dongqing Yang Hua Luan [22] studies the problem of mining closed sequential patterns over data stream sliding windows. An efficient algorithm SeqStream is developed to mine closed sequential patterns in stream windows incrementally, and various novel strategies are adopted in SeqStream to prune search space aggressively. Extensive experiments on both real and synthetic data sets show that SeqStream outperforms PrefixSpan, CloSpan and BIDE by a factor of about one to two orders of magnitude. 188 http://sites.google.com/site/ijcsis/

The input data is a set of sequences, called datasequences. Each data sequence is a ordered list of transactions (or itemsets), where each transaction is a sets of items (literals). Typically there is a transactiontime associated with each transaction. A sequential pattern also consists of a list of sets of items. The problem is to find all sequential patterns with a userspecified minimum support, where the support of a sequential pattern is the percentage of data sequences that contain the pattern. year R_pst 2003 66.55 2004 68.69 2005 79.72 2006 72.66 2007 68.08 4. Proposed Methodology: We have done study about pattern of different Result Analysis of Our University Result Data of Different Semesters as shown below. 1.2 Result Graph for BE-102 1.3 Year wise Data for Subject Code BE-103 Year R_percent 2003 88.62 2004 90.54 2005 91.57 2006 90.28 2007 90.94 1.1 Result Graph for BE-101 Year wise Data for Subject Code BE-101 Year Rst _Per 2003 62.5 2004 79.8 2005 71.3 2006 78.4 2007 60.4 1.2 Year wise Data for Subject Code BE-101 1.3 Result Graph for BE-103 189 http://sites.google.com/site/ijcsis/

1.4 Year wise Data for Subject Code BE-104 year R_pst 2003 88.62 2004 90.54 2005 91.57 2006 90.28 2007 90.94 1.4 Result Graph for BE-104 1.5 Year wise Data for Subject Code BE-105 Year R_percent 2003 72.8 2004 87.44 2005 69.45 2006 74.4 2007 29.69 The methodology applied to complete the research work of A Way to Undwerstand Various Pattern Mining Techniques for Selected Domain was divided into series of steps. We envisage to study and implement the following methods. - Study of Temporal relations. - Provide an overview, the research survey and summarizing previous work that investigated the various functions of data sequences in various domains. - Problem formulation and generate the frequent sequences. - Sub-division of the sequences based on structure of the sequence i.e. constraints based mining and extended sequence based mining with pruning strategies. - Analysis and evaluation of the proposed sequential pattern mining algorithm with item gap and time stamp. 5. Expected outcome of the proposed work : - Appropriate duration modeling for events in sequences. - Improving time and space complexities of algorithms. - Comparison with the existing models on extract sequence quality, number of extracted sequences and execution time. - Implementation of the proposed sequential patterns. - If implementation is successful then tested for evaluation. 1.5 Result Graph for BE-105 6. Bibliography in standard format : 190 http://sites.google.com/site/ijcsis/

[1] R. Agrawal, R. Srikant, ``Mining Sequential Patterns'', Proc. of the Int'l Conference on Data Engineering (ICDE), Taipei, Taiwan, March 1995. [2] Ayres J, Gehrke J, Yu T and Flannick J: "Sequential Pattern Mining using a Bitmap Representation" in Int'l Conf Knowledge Discovery and Data Mining, (2002) 429-435 [3] Garofalakis M, Rastogi R and Shim k, Mining Sequential Patterns with Regular Expression Constraints, in IEEE Transactions on Knowledge and Data Engineering,(2002), vol. 14, nr. 3, pp. 530-552 [4] Pei J, Han J. et al: PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth in Int'l Conf Data Engineering, (2001) 215-226 [5] Pei J and Han J: "Constrained frequent pattern mining: a pattern-growth view" in SIGKDD Explorations, (2002) vol. 4, nr. 1, pp. 31-39 [6] Antunes C and Oliveira A.L: "Generalization of Pattern-Growth Methods for Sequential Pattern Mining with Gap Constraints" in Int'l Conf Machine Learning and Data Mining, (2003) 239-251 [7] R. Srikant, R. Agrawal: ``Mining Sequential Patterns: Generalizations and Performance Improvements'', Proc. of the Fifth Int'l Conference on Extending Database Technology (EDBT), Avignon, France, March 1996. [8] Zaki M, "Efficient Enumeration of Frequent Sequences", in ACM Conf. on InformationKnowledge Management, (1998) 68-75 [9] R. Agrawal, A. Arning, T. Bollinger, M. Mehta, J. Shafer, R. Srikant: "The Quest Data Mining System", Proc. of the 2nd Int'l Conference on Knowledge Discovery in Databases and Data Mining, Portland, Oregon, August, 1996. [10] Eui-Hong (Sam) Han, Anurag Srivastava and Vipin Kumar: "Parallel Formulations of Inductive Classification Learning Algorithm" (1996). [11] Agrawal, R. Srikant: ``Fast Algorithms for Mining Association Rules'', Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, Sept. 1994. [12] J. Han, J. Chiang, S. Chee, J. Chen, Q. Chen, S. Cheng, W. Gong, M. Kamber, K. Koperski, G. Liu, Y. Lu, N. Stefanovic, L. Winstone, B. Xia, O. R. Zaiane, S. Zhang, H. Zhu, `DBMiner: A System for Data Mining in Relational Databases and Data Warehouses'', Proc. CASCON'97: Meeting of Minds, Toronto, Canada, November 1997. [13] Cheung, J. Han, V. T. Ng, A. W. Fu an Y. Fu, `` A Fast Distributed Algorithm for Mining Association Rules'', Proc. of 1996 Int'l Conf. on Parallel and Distributed Information Systems (PDIS'96), Miami Beach, Florida, USA, Dec. 1996. [14] Ron Kohavi, Dan Sommerfield, James Dougherty, "Data Mining using MLC++ : A Machine Learning Library in C++", Tools with AI, 1996 191 http://sites.google.com/site/ijcsis/