The basic data mining algorithms introduced may be enhanced in a number of ways.

Transcription

1 DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Data mining algorithms have traditionally assumed data is memory resident, as for example in the case of the Apriori algorithm for association rule mining or basic clustering algorithms. Modifications to such algorithms are required for them to work efficiently with large persistent data sets. Also, much data is no longer held in persistent form in data warehouses. The data is generated at such a rate that it is infeasible to store it in that way. Instead, data is represented by transient data streams. This necessitates further modified solutions to the data mining techniques we have studied so far. 1

2 As scalable architectures based on commodity hardware such as Hadoop have emerged, techniques to support data mining on these architectures have been developed. Also, architectures which support analytics with both real-time processing of stream data and batch-oriented processing of historical data are required. Another problem is that once a data mining model has been developed, there has traditionally been no mechanism for that model to be reused programmatically by other applications on other data sets. Hence, standards for data mining model exchange have been developed. Also, data mining has traditionally been performed by dumping data from the data warehouse to an external file, which is then transformed and mined. This results in a series of files for each data mining application, with the resulting problems of data redundancy, inconsistency and data dependence, which database technology was designed to overcome. Hence, techniques and standards for tighter integration of database and data mining technology have been developed. 2

3 DATA MINING OF LARGE DATA SETS Algorithms for classification, clustering, and association rule mining are considered. CLASSIFYING LARGE DATA SETS: SUPPORT VECTOR MACHINES To reduce the computational cost of solving the SVM optimization problem with large training sets, chunking is used. This partitions the training set into chunks each of which fits into memory. The support vector parameters are computed iteratively for each chunk. However, multiple passes of the data are required to obtain an optimal solution. Another approach is to use squashing. In this, the SVM is trained over clusters derived from the original training set, with the clusters reflecting the distribution of the original training records. A further approach reformulates the optimization problem to allow solution by efficient iterative algorithms. 3

4 CLUSTERING LARGE DATA SETS: K-MEANS Unless there is sufficient main memory to hold the data being clustered, the data scan at each iteration of the K-means algorithm will be very costly. An approach for large databases would Perform at most one scan of the database. Work with limited memory. Approaches include the following: Identify three kinds of data objects: those which are discardable because membership of a cluster has been established; those which are compressible which while not discardable belong to a well-defined subcluster which can be characterized in a compact structure; those which are neither discardable nor compressible which must be retained in main memory. Alternatively, first group data objects into microclusters and then perform 4 k-means clustering on those microclusters.

5 An approach developed at Microsoft uses such approaches as follows: 1. Read a sample subset of data from the database. 2. Cluster that data with the existing model as usual to produce an updated model. 3. On the basis of the updated model, decide for each data item from the sample whether it needs to be retained in memory discarded with summary information being updated retained in a compressed form as summary information. 4. Repeat from 1 until termination condition met. 5

6 ASSOCIATION MINING OF LARGE DATA SETS: APRIORI With one database scan for each itemset size tested, the cost of scans would be prohibitive for the Apriori algorithm unless the database is resident in memory. Approaches to enhance the efficiency of Apriori include the following. While generating 1-itemsets for each transaction, generate 2-itemsets at the same time, hashing the 2-itemsets to a hash table structure. All buckets whose final count of itemsets is less than the minimum support threshold can be ignored subsequently since any itemset therein will itself not have the minimum required support. 6

7 Overlap the testing of k-itemsets and (k+1)-itemsets by counting (k+1)- itemsets in parallel with counting k-itemsets. Unlike in conventional Apriori in which candidate (k+1)-itemsets are only generated after the k-itemset database scan, in this approach a database scan is divided into blocks before any of which candidate (k+1)-itemsets can be generated during the k-itemset scan. Two database scans only are needed if a partitioning approach is adopted under which transactions are divided into n partitions each of which can be held in memory. In the first scan frequent itemsets for each partition are generated. These are combined to create a candidate frequent itemsets list for the database as a whole. In the second scan, the actual support for members of the candidate frequent itemsets list is checked. 7

8 Pick a random sample of transactions which will fit in memory and search for frequent itemsets in that sample. This may result in some global frequent itemsets being missed. The chance of this happening can be lessened by adopting a lower minimum support threshold for the sample, with the remaining database then being used to check actual support for the candidate itemsets. A second database scan may be needed to ensure no frequent itemsets have been missed. 8

9 An alternative approach to increase efficiency - FP- growth - uses a tree structure which holds frequent itemsets in a compressed form in a prefix-tree structure known as an FP-tree. Using this, only two database scans are needed to identify all frequent itemsets: The first scan identifies frequent items. The second scan builds the FP-tree in which each path from the root represents a frequent itemset. Consider again the set of transactions seen when the Apriori algorithm was introduced, but with two additional transactions T5 and T6: Trans_id T1 T2 T3 T4 T5 T6 List_of_items I1,I3,I4 I2,I3,I5 I1,I2,I3,I5 I2,I5 I3,I5 I1,I4,I5 Assume frequent itemsets with minimum support 50% are required. 9

10 The first database scan is the same as Apriori, generating 1 item itemsets and their support. This is then stored in a list of descending support count: I5 5 I3 4 I2 3 I1 3 I4 2 The second database scan builds the FP-tree by processing the items in each transaction in the order of the list generated in the first scan. A branch is created for each transaction, sharing paths from the root for previously constructed branches with the same items. Each node, representing an item, also contains a count which is incremented as each transaction including that node is added. Finally, a linked list is created from each item in the list generated in the first scan to the nodes representing that item in the tree. 10

11 Hence, after the second scan for transactions T1 T6 the following structure results: root:6 I3:1 I5 5 I3 4 I2 3 I1 3 I1:1 I5:5 I2:1 I1:1 I4 insufficient support I3:3 I2:2 I1:1 FP-growth then processes the FP-tree to identify frequent itemsets. 11

12 STREAM MINING Data streams arise in many application areas including real-time sensors, internet activity and scientific experiments. Such data is characterized by being: Infinite Arriving continuously Arriving at a high rate Arriving in a time series which is significant It is not possible for such data to be stored in a conventional database or data warehouse the data volumes and stream flow are too great. Hence, new techniques with new data structures and algorithms are needed. 12

13 These techniques include the following. Sampling: use a probabilistic method for choosing which data items within the stream should be processed. Load shedding: a sequence of data streams is not processed. Sketching: only a subset of the information within the stream is processed Synopsis data structures: summarization techniques are used to hold information about the stream in a more space efficient form. Aggregation: statistical measures are used to summarize the data stream. Approximation algorithms: given the computational complexity of exact mining algorithms, algorithms which give approximate solutions with defined error bounds are sought instead. Sliding window: analysis is restricted to most recent data within the stream. Algorithm output granularity: the development of algorithms which adapt to available resources such as processor load and memory availability. 13

14 These techniques have been used in the development of classical data mining tasks such as clustering, classification and frequent itemset identification within stream data. For example, in the case of frequent itemset identification, the FP-tree structure can be used to hold itemset information which is updated incrementally as transaction data arrives. As the first batch of transactions is received, the ordering of items can be determined on the basis of their support as in the FP-growth algorithm seen above and an FP-tree created. This ordering is not subsequently changed. As subsequent batches of transactions are received, the FP-tree can be updated to reflect the revised support for itemsets. Since it is infeasible to store itemset information for all batches indefinitely, a tilted window approach is often used: as new batches of transactions are received, the itemsets are computed for those new batches individually, while itemset information for older batches are then aggregated over longer time periods. 14

15 DATA MINING ARCHITECTURE TRENDS The Hadoop technologies already seen supporting data warehousing are increasingly being used to support data mining and analytics. For example, Pig has been used extensively in organisations including Twitter to support data mining and analytics. With the increasing importance of stream data, a particular architectural challenge is how to support the real-time processing required for stream data as well as the batch-oriented processing needed for historical data: the Hadoop technologies seen so far were originally designed with the latter in mind. One influential approach has been the lambda architecture which consists of 3 layers with all data dispatched to the batch and speed layers: A batch layer manages the append-only master data set and precomputes views on this A serving layer indexes these views to enable ad-hoc low latency queries A speed layer deals with recent data only to avoid the high latency of updates to the serving layer. 15

16 Hadoop technologies are well-suited to supporting the requirements of the batch layer. Column store or NoSQL technologies are often used to support the storage requirements of the serving layer. Stream technologies are required to support the speed layer, for example Storm: Storm is a distributed computing framework designed for real-time stream processing. The processing model is similar to MapReduce in that it supports a graphoriented workflow. However, while the MapReduce model is batch-oriented, Storm has been designed to support real-time processing of stream rather than batch data with streams continuing indefinitely. Storm was developed at BackType which was acquired by Twitter. It was developed and deployed for internal use before being made open-source. 16

17 DATA MINING STANDARDS Data mining standards and related standards for data grids, web services and the semantic web enable the easier deployment of data mining applications across platforms. Standards cover: The overall KDD process. Metadata interchange with data warehousing applications. The representation of data cleaning, data reduction and transformation processes. The representation of data mining models. APIs for performing data mining processes from other languages including SQL and Java. 17

18 CRISP-DM, PREDICTIVE MODEL MARKUP LANGUAGE PMML CRISP-DM (CRoss Industry Standard Process for Data Mining) specifies a process model covering the following 6 phases of the KDD process: Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment PMML is an XML-based standard developed by the Data Mining Group ( which is a consortium of data mining product vendors. PMML represents data mining models as well as operations for cleaning and transforming data prior to modeling. The aim is to enable an application to produce a data mining model in a form PMML XML - which another data mining application can read and apply. 18

19 Below is shown the PMML representation of an example association rules model for the following transaction data. 19

20 20

21 The association model XML schema specification follows. 21

22 22

23 The components of a PMML document consist of (the first two and last being used in the example model above): Data dictionary. Defines models input attributes with type and value range. Mining schema. Defines the attributes and roles specific to a particular model. Transformation dictionary. Defines the following mappings: normalization (continuous or discrete values to numbers), discretization (continuous to discrete values), value mapping (discrete to discrete values), aggregation (grouping values as in SQL). Model Statistics. Statistics about individual attributes. Models. Includes regression models, cluster models, association rules, neural networks, Bayesian models, sequence models. PMML is used within the standards CWM, SQL/MM Part 6 Data Mining, JDM, and MS Analysis Services (OLE DB for Data Mining) providing a degree of compatibility between them all. 23

24 COMMON WAREHOUSE METAMODEL CWM CWM supports the interchange of warehouse and business intelligence metadata between warehouse tools, warehouse platforms and warehouse metadata repositories in distributed heterogeneous environments. SQL/MM DATA MINING The SQL Multimedia and Applications Package Standard (SQL/MM) Part 6 specifies an SQL interface to data mining applications and services through SQL:1999 user-defined types as follows. User-defined types for four data mining functions: association rules, clustering, classification and regression. 24

25 Routines to manipulate these user-defined types to allow: Setting parameters for mining activities. Training of mining models, in which a particular mining technique is chosen, parameters for that technique are set, and the mining model is built with training data sets. Testing of mining models applicable only to regression and classification models, in which the trained model is evaluated by comparing with results for known data. Application of mining models in which the model is applied to new data to cluster, predict or classify as appropriate. This phases is not applicable to rule models in which rules are determined during the training phase. User-defined types for data structures common across these data mining models. Functions to capture metadata for data mining input. 25

26 For example, for the association rule model type DM_RuleModel the following methods are supported: DM_impRuleModel CHARACTER LARGE OBJECT (DM_MaxContentLength)) Import rule model expressed as PMML Return DM_RuleModel DM_expRuleModel() Export rule model as PMML DM_getNORules() Return number of rules DM_getRuleTask() Return data mining task value, data mining settings etc. 26

27 JAVA DATA MINING JDM Java Data Mining ( is a Java API developed under the Java Community Process supporting common data mining operations as well as the metadata supporting mining activities. JDM 1.0 supports the following mining functions: classification, regression, attribute importance (ranking), clustering and association rules. JDM 1.0 supports the following tasks: model building, testing, application and model import/export. JDM does not support tasks such as data transformation, visualization and mining unstructured data. JDM has been designed so that metadata maps closely to PMML to provide support for the generation of XML for mining models. Likewise, metadata maps closely to CWM to support generation of XML for mining tasks. The JDM API maps closely to SQL/MM Data Mining to support an implementation of JDM on top of SQL/MM. 27

28 OLE DB FOR DATA MINING & DMX SQL SERVER ANALYSIS SERVICES OLE DB for Data Mining, developed by Microsoft and incorporated in SQL Server Analysis Services, specifies a structure for holding information defining a mining model and a language for creating and working with these mining models. The approach has been to adopt an SQL-like framework for creating, training and using a mining model a mining model is treated as though it is a special kind of table. The DMX language, which is SQL-like, is used to create and work with models. 28

29 CREATE MINING MODEL [AGE PREDICTION] ( [CUSTOMER ID] LONG KEY, [GENDER] TEXT DISCRETE, [AGE] DOUBLE DISCRETIZED() PREDICT, [ITEM PURCHASES] TABLE ([ITEM NAME] TEXT KEY, [ITEM QUANTITY] DOUBLE NORMAL CONTINUOUS, [ITEM TYPE] TEXT RELATED TO [ITEM NAME] ) ) USING [MS DECISION TREE] The column to be predicted, AGE, is identified, together with the keyword DISCRETIZED() indicating that a discretization into ranges of values is to take place. ITEM QUANTITY is identified as having a normal distribution, which may be exploited by some mining algorithms. 29

30 ITEM TYPE is identified as being related to ITEM NAME. This reflects a 1- many constraint each item has one type. It can be seen from the column specification of the table inserted into, that a nested table representation is used with ITEM PURCHASES iself being a table nested within AGE PREDICTION. A conventional table representation would result in duplicate data in a single non-normalized table or data in multiple normalized tables. The USING clause specifies the algorithm that will be used to construct the model. Having created a model, it may be populated with a caseset of training data using an INSERT statement. Predictions are obtained by executing a prediction join to match the trained model with the caseset to be mined. This process can be thought of as matching each case in the data to be mined with every possible case in the trained model to find a predicted value for each case which matches a case in the model. 30

31 SQL Server Analysis Services supports data mining algorithms for use with: conventional relational tables OLAP cube data Mining techniques supported include: classification - decision trees clustering - k-means association rule mining Predictive Model Markup Language (PMML) is supported. SQL Server Analysis Services Data Mining Tutorials 31

32 DATA MINING PRODUCTS OPEN SOURCE A number of open-source packages and tools support data mining capabilities, including R, Weka, RapidMiner and Mahout. R is both a language for statistical computing and visualisation of results, and a wider environment consisting of packages and other tools for the development of statistical applications. Data mining functionality is supported through a number of packages, including classification with decision trees using the rpart package, clustering with k- means using the kmeans package, and association rule mining with Apriori using the arules package. Weka is a collection of data mining algorithms written in Java including those for classification, clustering and association rule mining as well as for visualisation. 32

33 RapidMiner consists of both tools for developing standalone data mining applications and an environment for use of RapidMiner functions from other programme languages. Weka and R algorithms may be integrated within RapidMiner. An XML-based interchange format is used to enable interchange of data between data mining algorithms. Mahout is an Apache project to develop data mining algorithms for the Hadoop platform. Core MapReduce algorithms for clustering, classification are provided, but the project also incorporates algorithms designed to run on single-node architectures and non-hadoop cluster architectures. 33

34 DATA MINING PRODUCTS ORACLE Oracle supports data mining algorithms for use with conventional relational tables. Mining techniques supported include: classification - decision trees, support vector machines... clustering - k-means... association rule mining Apriori Predictive Model Markup Language (PMML) support is included In addition to SQL and PL/SQL interfaces, until Oracle 11, a Java API was supported to allow applications to be developed which mine data. This was Oracle s implementation of JDM 1.0 introduced above. 34

35 From Oracle 12, the Java API is no longer supported. Instead, support for R has been introduced with the Oracle R Enterprise component. Oracle R Enterprise allows R to be used to perform analysis on Oracle database tables. A collection of packages supports mapping of R data types to Oracle database objects and the transparent rewriting of R expressions to SQL expressions on those corresponding objects. A related product is Oracle R Connector for Hadoop. This is an R package which provides an interface between a local R environment and file system and Hadoop enabling R functions to be executed on data in memory, on the local file system and HDFS. 35

36 READING P Bradley et al., Scaling Mining Algorithms to Large Databases, CACM, 45(8), J Gama, A Survey on Learning From Data Streams: Current and Future Trends, Progress in Artificial Intelligence, 1(1), 2012 (sections 3.1, 3.2 optional). J Lin and A Kolcz, Large-scale Machine Learning at Twitter, Proc. SIGMOD 12, May 2012 (sections 5, 6 optional). X Liu et al., Survey of Real-time Processing Systems for Big Data, Proc. IDEAS 14, July A Toshniwal et al., Storm@Twitter, Proc. SIGMOD 14, June