Improved Data mining approach to find Frequent Itemset Using Support count table

Transcription

1 Improved Data mining approach to find Frequent Itemset Using Support count table Ramratan Ahirwal 1, Neelesh Kumar Kori 2 and DrYK Jain 3 1 Samrat Ashok Technological Institute Vidisha (M P) India 2 Samrat Ashok Technological Vidisha (M P) India 3 Samrat Ashok Technological Institute Vidisha (M P) India Abstract: Mining frequent item sets has been widely studied over the last decade Past research focuses on mining frequent itemsets from static database In many of the new applications mining time series and data stream is an important task now Last decade, there are mainly two kinds of algorithms on frequent pattern mining One is Apriori based on generating and testing, the other is FP-growth based on dividing and conquering, which has been widely used in static data mining But with the new requirements of data mining, mining frequent pattern is not restricted in the same scenario In this paper we focus on the new miming algorithm, where we can find frequent pattern in single scan of the database and no candidate generation is required To achieve this goal our algorithm employ one table which retain the information about the support count of the itemset and the table is virtual for static database means generated whenever required to generate frequent items and may be useful for time series database So our algorithm is suitable for static as well as for dynamic data mining Result shows that the algorithm is useful in today s data mining environment Keywords: Apriori, Association Rule, Frequent Pattern, Data Mining 1 INTRODUCTION Mining data streams is a very important research topic and has recently attracted a lot of attention, because in many cases data is generated by external sources so rapidly that it may become impossible to store it and analyze it offline Moreover, in some cases streams of data must be analyzed in real time to provide information about trends, outlier values or regularities that must be signaled as soon as possible The need for online computation is a notable challenge with respect to classical data mining algorithms [1], [2] Important application fields for stream mining are as diverse as financial applications, network monitoring, security problems, telecommunication networks, Web applications, sensor networks, analysis of atmospheric data, etc The innovation in computer science have made it possible to acquire and store enormous amounts of data digitally in databases, currently giga or terabytes in a single database and even more in the future Many fields and systems of human activity have become increasingly dependent on collected, stored, and processed information However, the abundance of the collected data makes it laborious to find essential information in it for a specific purpose Data mining is the analysis of (often large) observational datasets from the database, data warehouse or other large repository incomplete, noisy, ambiguous, the practical application of random data to find unsuspected relationships and summarize the data that are both understandable and useful to the data owner It is a means that data extraction, cleaning and transformation, analysis, and other treatment models, and automatically discovers the patterns and interesting knowledge hidden in large amounts of data, this helps us make decisions based on a wealth of data Information communication mode of software development lies in how to collection, analysis, and mine out the hidden useful information in the various data from information communication between developers and the staff interaction with manages, and then used the knowledge to make decision oustead College uses database technology to manage the library currently Its main purpose is to facilitate the procurement of books, cataloging, and circulation management In order to better satisfy the needs of readers, we must to explore the needs of readers, to provide the information which they need initiatively Most current library evaluation techniques focus on frequencies and aggregate measures; these statistics hide underlying patterns Discovering these patterns is the key that use library services [3] Data mining is applied to library operations [4]With the fast development of the technology and the more requirements of the users, the dynamic elements in data mining are becoming more important, including dynamic databases and the knowledge bases, users' interestingness and the data varying with time and space I order to solve the problems such as low effectiveness; high randomness and hard implementation in dynamic mining, more research on dynamic data mining have been done In [5][6], an evolutionary immune mechanism was proposed based on the fact that the elements involved in the domains could be modeled as the ones in immune models It focused on how to utilize the relationship between antigens and antibodies in a dynamic data mining such as an Volume 1, Issue 2 July-August 2012 Page 195

2 incremental mining However, the sole immune mechanism and relative algorithm runs more effectively only on incremental situations rather than on others Its performance and function have to be improved when used in more complex and dynamic environments like Web We provide here an overview of executing data mining services and association rule The rest of this paper is arranged as follows: Section 2 introduces Data Mining and KDD; Section 3 describes about Literature review Section 4 shows the description of proposed work Section 5 result analysis of the algorithm and proposed work Section 6 describes the Conclusion and outlook 2 DATA MINING AND KDD Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both Data mining software is one of a number of analytical tools for analyzing data It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases There are several algorithm are devised for this[5]the process is shown in Figure 1 Although data mining is a relatively new term, the technology is not Companies have used powerful computers to sift through volumes of supermarket scanner data and analyze market research reports for years However, continuous innovations in computer processing power, disk storage, and statistical software are dramatically increasing the accuracy of analysis while driving down the cost At an abstract level, the KDD field is concerned with the development of methods and techniques for making sense of data The basic problem addressed by the KDD process is one of mapping low-level data (which are typically too voluminous to understand and digest easily) into other forms that might be more compact (for example, a short report), more abstract approximation or model of the process that generated the data), or more useful (for example, a predictive model for estimating the value of future cases) At the core of the process is the application of specific data-mining methods for pattern discovery and extraction The traditional method of turning data into knowledge relies on manual analysis and interpretation For example, in the health-care industry, it is common for specialists to periodically analyze current trends and changes in health-care data, say, on a quarterly basis The specialists then provide a report detailing the analysis to the sponsoring health-care organization; this report becomes the basis for future decision making and planning for health-care management In a totally different type of application, planetary geologists sift through remotely sensed images of planets and asteroids, carefully locating and cataloging such geologic objects of interest as impact craters Be it science, marketing, finance, health care, retail, or any other field, the classical approach to data analysis relies fundamentally on one or more analysts becoming intimately familiar with the data and serving as an interface between the data and the users and products For these (and many other) applications, this form of manual probing of a data set is slow, expensive, and highly subjective In fact, as data volumes grow dramatically, this type of manual data analysis is completely impractical in many domains Databases are increasing in size in two ways: (1) The number N of records or objects in the database and (2) The number d of fields or attributes to an object Figure 1: Data Mining Algorithm Databases containing on the order of N = 109 objects are becoming increasingly common, for example, in the astronomical sciences Similarly, the number of fields d can easily be on the order of 102 or even 103, for example, in medical diagnostic applications Who could be expected to digest millions of records, each having tens or hundreds of fields? We believe that this job is certainly not one for humans; hence, analysis work needs to be automated, at least partially The need to scale up human analysis capabilities to handling the large number of bytes that we can collect is both economic and scientific Businesses use data to gain competitive advantage, increase efficiency, and provide more valuable services Data we capture about our environment are the basic evidence we use to build theories and models of the universe we live in Because computers have enabled humans to gather more data than we can digest, it is only natural to turn to computational techniques to help us Volume 1, Issue 2 July-August 2012 Page 196

3 unearth meaningful pattern and structure from the massive volumes of data Hence, KDD is an attempt to address a problem that the digital information era made a fact of life for all of us: data overload 3 LITERATURE REVIEW In 2011, jinwei Wang et al [12] proposed to conquer the shortcomings and deficiencies of the existing interpolation technique of missing data, an interpolation technique for missing context data based on Time-Space Relationship and Association Rule Mining (TSRARM) is proposed to perform spatiality and time series analysis on sensor data, which generates strong association rules to interpolate missing data Finally, the simulation experiment verifies the rationality and efficiency of TSRARM through the acquisition of temperature sensor data In 2011, M Chaudhary et al [13] proposed new and more optimized algorithm for online rule generation The advantage of this algorithm is that the graph generated in our algorithm has less edge as compared to the lattice used in the existing algorithm The Proposed algorithm generates all the essential rulesalso and no rule is missing The use of non redundant association rules help significantly in the reduction of irrelevant noise in the data mining process This graph theoretic approach, called adjacency lattice is crucial for online mining of data The adjacency lattice could be stored either in main memory or secondary memory The idea of adjacency lattice is to pre store a number of large item sets in special format which reduces disc I/O required in performing the query In 2011,Fu et al [14] analyzes Real-time monitoring data mining has been a necessary means of improving operational efficiency, economic safety and fault detection of power plant Based on the data mining arithmetic of interactive association rules and taken full advantage of the association characteristics of real-time test-spot data during the power steam turbine run, the principle of mining quantificational association rule in parameters is put forward among the real-time monitor data of steam turbine Through analyzing the practical run results of a certain steam turbine with the data mining method based on the interactive rule, it shows that it can supervise stream turbine run and condition monitoring, and afford model reference and decision-making supporting for the fault diagnose and condition-based maintenance In 2011,Xin et al [15] analyzes that use association rule learning to process statistical data of private economy and analyze the results to improve the quality of statistical data of private economy Finally the article provides some exploratory comments and suggestions about the application of association rule mining in private economy statistics 4 PROPOSED WORK AND ALGORITHM The frequent itemset mining is introduced in [2] by Agrawal and Srikant To facilitate our discussion; we give the formal definitions as follows Let I = (i 1, i 2, i 3, i m ) be a set of items An itemset X is a subset of I X is called k-itemset if X = k; where k is the size (or length) of the itemset A transaction T is a pair (tid; X), where tid is a unique identifier of a transaction and X is an itemset A transaction (tid;x) is said to contain an itemset Y iff Y X: A dataset D is a set of transactions Given a dataset D, the support of an itemset X, denoted as Supp(X), is the fraction of transactions in D that contain X An itemset X is frequent if Supp (X) is no less than a given threshold S 0 An important property of the frequent itemsets, called the Apriori property, is that every nonempty subset of a frequent itemset must also be frequent The problem of finding frequent itemsets can be specified as: given a dataset D and a support threshold S 0 ; to find any itemset whose support in D is no less than S 0 It is clear that the Apriori algorithm needs at most l + 1,scans of database D if the maximum size of frequent itemset is l:on the context of data streams, to avoid disk access, previous studies focus on finding the approximation of frequent itemsets with a bound of space complexity Mining frequent itemsets in static databases, all the frequent itemsets and their support counts derived from the original database are retained When the transactions are added or expired, the support counts of the frequent itemsets contained in them are recomputed By resuing the frequent itemsets and their counts retained, the number of candidate itemsets generated during the mining process can be reduced Later to rescan the original database is required because non-frequent itemsets can be frequent after the database is updated Therefore they cannot work without seeing the entire database and cannot be applied to data stream In our approach we introduce new method in which we required only single scan of database D to count the support of each itemset and no candidate generation and pruning is required to find the frequent itemsets So our algorithms reduce the disk access time and directly find the frequent itemset by using support count table This method is application for static database as well as for dynamic database if the table is created at the initial stage 41 Support Cont Table: As state previous that every itemset X of transaction T is a subset of I (X I) and a set of such transactions is the database D So in database D every transaction itemset X will be an element of 2 I -1, where 2 I is a power set of I Power set of I contain all the subsets of I that may be in the form of transactions itemset in the transaction database D except Hence our algorithm employ one table that s name is support count table That table Volume 1, Issue 2 July-August 2012 Page 197

4 assumes as virtual and created when required to finding frequent itemset The Length of the table is (2 I -1) 2 Two field of attributes are itemset and support count In this table we make entries of frequency count of each itemset that are observed in transaction database The frequency count of each itemset is the count of the occurrence of such itemset in transactional database D This table is generated and may be stored in cache memory till the frequent itemset are not found Generated table may be used for stationary database as well as for time series database Table can be given as follows given below 42 Entries in Support count table: Support count table is a table that may be useful to find frequent itemset from static datasets as well as from stream line dataset where we used windowing concept In static database this table may be created when we want to analyze the database by single scan of the database and make entries in the table for every transaction In support counts table initially all the entries of support count of each itemsets are set to zero If we are using database D that is static, fixed then we update the table by single scanning of the database D and make entries of each itemset in the table For each transaction itemset X in D find the corresponding itemset in table and increment the count of that itemset In this way for each T we make entries Later may retain the table in memory till the observation not complete So the added or expired transactions only required to update the table If we consider the database D as random or stream line database then the table may be more useful because every incoming or expired transaction only required to update the table by incrementing or decrementing the corresponding itemset and this table may be stored in efficient way so we can use it to find the frequent item sets or association rules In this approach we are not required to save the database in the disk memory only necessary to save the table and used whenever necessary to find frequent itemset Table 1: Support count table S T NO Itemset (A) support count (S count ) 1 2 I -1 For example Let I=(i 1,i 2,i 3,i 4 ) be the set of items and the different types of itemset that may be generated from the I are {i 1 },{i 2 },{i 3 } {i 1,i 2,i 3,i 4 }Then all transaction itemset X that may occur in database D are all will be any subset of I and equal to itemset Now table created initially as given below Table 2: Initial support count table for I=(i 1,i 2,i 3,i 4 ) No Itemset (A) Support count(s count ) 1 {i1} 0 2 {i2} 0 3 {i3} 0 4 {i4} 0 5 {i1,i2} 0 6 {i1,i3} 0 7 {i1,i4} 0 8 {i2,i3} 0 9 {i2,i4} 0 10 {i3,i4} 0 11 {i1,i2,i3} 0 12 {i1,i2,i4} 0 13 {i1,i3,i4} 0 14 {i2i3,i4} 0 15 {i1,i2,i3,i4} 0 43 Proposed Method to find frequent itemset In our proposed work we are giving the method that may be useful for static as well as for stream line database to find frequent itemset In our proposed work we employ the support count table that required only to scaning the database once to make the entries in the table for each transaction the table retains the information till the observation not complete or frequent itemset not found When the trasactions are added into dataset or expired from the dataset simultaneously update the table The updated support count table has the frequency count of each itemset To find the frequent itemset for any threshold value we scan the table not the database As in A-priori we are required l+1 scan of the dataset and generate the candidates to find frequent set Our approach has only single scan of database and no candidate generation is required Table has entries of frequency count of every itemset but not the total support count of that itemset The frequency count of each itemset is the count of the occurrence of such itemset in transactional database D so to find frequent itemset we are required to find the total support count of that itemset, Total support count of an itemset is the count of the occurrence of total items of that itemset in the no of transactions in D This total count in our scheme is calculated by scanning the table and then found total support count compared with the threshold S 0 if the count is greater than the threshold then itemset is included in frequent set This procedure is repeated for every itemset to find frequent them Algorithm: To find frequent itemset Input: A database D and the support threshold S 0 Output: frequent itemsets F itemset Method Volume 1, Issue 2 July-August 2012 Page 198

5 Step:1 Scan the transaction database D and update the Support count table S T As given in sec42, F itemset={ } Step:2 for ( i=1; i<2 I ; i++) //for each itemset A in S T repeat the steps //2 I gives total element in power set of I T Count =0; //Total count Step3: for (j=1; j< 2 I ; j++) // Repeat step3 to find total count Step:31 If Ai Aj Step:4 If (Tcount S 0) Step:5 Go to step 2 Step:6 End T Count = T Count +Scount(j) Then F itemset = F itemset U Ai To better explain our algorithm, now we consider one example: Let I= (10, 20, 30, 40) be the set of four items & value assumed for the threshold is 2Total transactions in D are considered 15Table of transactions of D is given below: ti transactions d 1 {10} 2 {10,20} 3 {30,40} 4 {10,20,30,40 } 5 {10,30} 6 {10,30} 7 {30,40} 8 {20,30,40} 9 {20,30,40} 10 {10,20,30} 11 {20,30} 12 {40} 13 {20,30} 14 {10,20,30} 15 {10} Step1: By scanning the database the table of support count will be as follows: Given in table3 Step2: To find frequent itemset we make use of support count table given below as follows: Table 3: Frequency count for above example No Itemset (A) Supportcount(Scount) 1 {10} 2 2 {20} 0 3 {30} 0 4 {40} 1 5 {10,20} 1 6 {10,30} 2 7 {10,40} 0 8 {20,30} 2 9 {20,40} 0 10 {30,40} 2 11 {10,20,30} 2 12 {10,20,40} 0 13 {10,30,40} 0 14 {2030,40} 2 15 {10,20,30,40 } 1 To check itemset {10} is frequent or not, we obtain the total support count by scaning the support count table for {10}, so from the table total support of {10} is 8This value of total support count is compared with threshold value 2, since threshold value is 2 and less than the total count, so the itemset {10} is frequent itemset and included in F itemset This process is repeated for every itemset In such a way we get every frequent itemset using support count table Frequent itemset for the given dataset is: F itemset ={{10},{20},{30},{40},{10,20},{10,30},{20,30},{ 20,40},{30, 40}, {10,20,30},{20,30,40}} 5 RESULT ANALYSIS To study the performance of our proposed algorithm, we have done several experiments The experimental environment is intel core processor with operating system is window XP The algorithm is implemented with java netbeans 71The meaning of used parameters are as follows D for transaction database, I for no of items in transactions and S 0 for MIN support Table 4 shows the results for execution time in sec when I=5 and transactional database D scale-up from 50 to 1000 and MINsupport S scale-up from 2 to 8We see from the table Volume 1, Issue 2 July-August 2012 Page 199

6 that when in rows we scale-up the MIN support time for exection is linearly decreasing and scale-up the database D time is increasing but not in some linear way Table 4: Execution time(s)-when D scale-up from 50 to 1000 & S scale-up from 2 to 8 No of Different Minimum Support(S) transactions Figure 4: Comparison of execution time (s) for MIN support (S 0 =2) with algorithm given in reference [16] Figure4 shows the comparison of our proposed algorithm execution time with S 0 =2 and database D scale-up from 50 to 175 Comparison result shows that our approach gives some better performance than the method proposed in reference [16] Execution Time in Sec Figure 2: Execution time(s), MIN support (S 0 =2); Figure 2 shows the algorithm execution time {for MIN support (S 0 =2), I=5} is increasing almost linearly with the increasing of dataset size It can be concluded our algorithm has a good scalable performance Now later to examine the scalability performance of our algorithm we increased the dataset D from 1000 to 6000 with same parameter MIN support (S 0 =2), I=5, result is given in figure 5 Figure 3: Execution time(s), Transaction database (D=200); No of Transactions Figure 5: Scale-up: Number of transactions 6 CONCLUSION AND OUTLOOK Data mining, which is the exploration of knowledge from the large set of data, generated as a result of the various data processing activities Frequent Pattern Mining is a very important task in data mining The previous approaches applied to generate frequent set generally adopt candidate generation and pruning techniques for the satisfaction of the desired objective In this paper we present an algorithm which is useful in data mining task and knowledge discovery without candidate generation and our approach reduce the disk access time and directly find the frequent itemset by using support count table The proposed method work well with static dataset by using support count table as well as for mining streams requires fast, real-time processing in order to keep up with the high data arrival rate and mining results are expected to be available within short response timewe also proof the algorithm for static dataset by the concerning graph results Volume 1, Issue 2 July-August 2012 Page 200

7 In this paper we improve the performance by without candidate values The experiment indicates that the efficiency of the algorithm is faster and some efficient than presented algorithm of itemset mining REFERENCES [1] M M Gaber, A Zaslavsky, and S Krishnaswamy, Mining data streams: A review, ACM SIGMOD Record, vol Vol 34,no 1, 2005 [2] C C Aggarwal, Data Streams: models and algorithms Springer, 2007 [3] Nicholson, S The Bibliomining Process: Data Warehousing and Data Mining for Library Decision- Making Information Technology and Libraries 2003, 22(4): [4] Jiann-Cherng Shieh, Yung-Shun Lin Bibliomining User Behaviors in the Library Journal of Educational Media & Library Sciences2006, 44(1):36-60 [5] Yiqing Qin, Bingru Yang, Guangmei Xu, et al Research on Evolutionary Immune Mechanism in KDD [A] In: Proceedings of Intelligent Systems and Knowledge Engineering 2007 (ISKE2007) [C], Cheng Du, China, October, 2007, [6] Yang B R Knowledge discovery based on inner mechanism: construction, realization and application [M] USA: Elliott & Fitzpatrick Inc 2004 [7] Binesh Nair, Amiya Kumar Tripathy, Accelerating Closed Frequent Itemset Mining by Elimination of Null Transactions, Journal of Emerging Trends in Computing and Information Sciences, Volume 2 No7, JULY 2011, pp [8] ERamaraj and NVenkatesan, Bit Stream Mask- Search Algorithm in Frequent Itemset Mining, European Journal of Scientific Research ISSN X Vol27 No2 (2009), pp [9] Shilpa and Sunita Parashar, Performance Analysis of Apriori Algorithm with Progressive Approach for Mining Data, International Journal of Computer Applications ( ) Volume 31 No1, October 2011, pp [10] G Cormode and M Hadiieleftheriou, Finding frequent items in data streams, In Proceedings of the 34th International Conference on Very Large Data Bases (VLDB), pages , Auckland, New Zealand, 2008 [11] DY Chiu, YH Wu, and AL Chen, Efficient frequent sequence mining by a dynamic strategy switching algorithm, The International Journal on Very Large Data Bases (VLDB Journal), 18(1): , 2009 [12] Jinwei Wang and Haitao Li, An Interpolation Approach for Missing Context Data Based on the Time- Space Relationship and Association Rule Mining,Multimedia Information Networking and Security (MINES), 2011,IEEE [13] Chaudhary, M,Rana, A, Dubey, G, Online Mining of data to generate association rule mining in large databases, Recent Trends in Information Systems (ReTIS), 2011 International Conference on Dec 2011,IEEE [14] Fu Jun,Yuan Wen-hua, Tang Wei-xin,Peng Yu, study on Monitoring Data Mining of Steam Turbine Based on Interactive Association Rules,IEEE 2011, Computer Distributed Control and Intelligent Environmental Monitoring (CDCIEM) [15] Jinguo, Xin; Tingting, Wei, The application of association rules mining in data processing of private economy statistics, E -Business and E -Government (ICEE), 2011 IEEE [16] Weimin Ouyang and Qinhua Huang, Discovery Algorithm for mining both Direct and Indirect weighted Association Rules, Internatinal conference on Artificial Intelligence and Computational Intelligence, pages ,IEEE 2009 AUTHORS Mr Ram Ratan Ahirwal has received his BE(First) degree in Computer Science & Engineering from GEC Bhopal University RGPV Bhopal in 2002 During 2003, August he joined Samrat Ashok Technological Institute Vidisha (M P) as a lecturer in computer Science & engg Dept and complete his MTech Degree (with hons) as sponsored candidate in CSE from SATI (Engg College), Vidisha University RGPV Bhopal, (MP) India in 2009Currently he is working as assistant professor in CSE dept, SATI Vidisha He has more than 12 publications in various referred international jouranal and in international conferences to his credit His areas of interests are data mining, image processing, computer network, network security and natural language processing Neelesh Kumar Kori received his BE (First) degrees in Information Technology from UIT, BU Bhopal (MP) India in 2008 and currently he is pursuing M Tech from SATI Vidisha (MP), India in Computer Science & Engineering DrYKJain, Head CSE Deptt, SATI (Degree) Engg College Vidisha, (MP), India He has more than publications in various referred international jouranal and in international conferences to his credit His areas of interests are image processing, computer network Volume 1, Issue 2 July-August 2012 Page 201