Integrating Data Mining with SQL Databases: OLE DB for Data Mining

Size: px
Start display at page:

Download "Integrating Data Mining with SQL Databases: OLE DB for Data Mining"

Transcription

1 Integrating Data Mining with SQL Databases: OLE DB for Data Mining Amir Netz Surajit Chaudhuri Usama Fayyad 1 Jeff Bernhardt Microsoft Corporation Abstract The integration of data mining with traditional database systems is key to making it convenient, easy to deploy in real applications, and to growing its user base. In this paper we describe the new API for data mining proposed by Microsoft as extensions to OLE DB standard. We illustrate the basic notions that motivated the API's design and describe the key components of an OLE DB for Data Mining provider. We also include examples of the usage and treat the problems of data representation and integration with the SQL framework. We believe this new API will go a long way in enabling deployment of data mining in enterprise data-warehouses. A reference implementation of a provider is available with the recent release of Microsoft SQL Server 2000 database system. 1. Introduction Progress in database technology has provided the foundation that made massive data warehouses of business data ubiquitous. Increasingly, such data warehouses are built using relational database technology. Recently, there has been a tremendous commercial interest in mining the information in these data warehouses. However, building mining applications on data over relational databases is nontrivial and requires significant work on the part of application builders. Therefore, what is needed is a strong system support to ease the burden of developing mining applications against relational databases. In this paper, we introduce the OLE DB for Data Mining (henceforth abbreviated as OLE DB DM in this paper) Application Programming Interface (API). As we will demonstrate in the rest of this paper, if a data source supports the OLE DB DM (i.e., if thedatasourceisanoledbdm provider ),thenthe task of building mining applications is simplified because of the native support for DM primitives in OLE DB DM. The case for supporting such an API is best understood by reviewing the current state of the art. Traditionally, data mining tasks are performed assuming that the data is in memory. In the last decade, there has been an increased focus on scalability but research work in the community has focused on scaling analysis over data that is stored in the file system. This tradition of performing data mining outside the DBMS has led to a situation where starting a data mining project requires its own separate environment. Data is dumped or sampled out of the database, and then a series of Perl, Awk, and special purpose programs are used for data preparation. This typically results in the familiar large trail of droppings in the file system, creating an entire new data management problem outside the database. In particular, such export creates nightmares of data consistency with data in warehouses. This is ironic because database systems were created to solve exactly such data management problems. Once the data is adequately prepared, the mining algorithms are used to generate data mining models. However, past work in data mining has barely addressed the problem of how to deploy models, e.g., once a model is generated, how to store, maintain, and refresh it as data in the warehouse is updated, how to programmatically use the model to do predictions on other data sets, and how to browse models. Over the life cycle of an enterprise application, such deployment and management of models remains one of the most important tasks. The objective of proposing OLE DB DM is to alleviate problems in deployment of models and to ease the data preparation by working directly on relational data. Another motivation to introduce OLE DB DM is to enable enterprise application developers to participate in building data mining solutions. Easing the ability to develop integrated solutions is critical for the growth of data mining technology in the enterprise space. In enterprises, data mining needs to be perceived as a valueadded component along with traditional decision support techniques, e.g., traditional SQL as well as OLAP querying environments [3]. In order to ensure success of data mining in this regard, it is important to recognize that a typical enterprise application developer is hardly an expert in statistics and pattern recognition. Rather, he or 1 Author s current affiliation: digimine, Inc.

2 she is comfortable using traditional database APIs/tools based on SQL, OLE DB, and other well-known standards and protocols. Therefore, it is extremely important that the infrastructure for supporting data mining solutions is aligned with traditional database development environment and with APIs for database access. Consequently, we decided to exploit the well-known OLE DB API to ensure that data mining models and operations gain the status of first-class objects in the mainstream database development environment. Our proposal builds upon OLE DB [7], an application programming interface that has been gaining popular use for universal data access not only from relational systems but also from any data source that can be viewed as a set of tables (rowsets) as well. It is important to point out that OLE DB for DM is independent of any particular provider or software and is meant to establish a uniform API. Another feature of our proposal is that it is not specialized to any specific mining model but is structured to cater to all well-known mining models. Any party interested in using this interface is encouraged to do so by building its own provider, as with OLE DB (and ODBC). However, we should also add that one of the strengths of our proposal is that we have implemented this API as part of the recent release of Microsoft SQL Server 2000 (Analysis Server component). The proposal also incorporates feedback from a multitude of vendors in the data mining space who participated in preview meetings and calls for reviews over the past 18 months. Our proposal is not focused on any particular KDD algorithms per se since our intent is not to propose new algorithms, but to suggest a system infrastructure that makes it possible to plug in any algorithm and enable it to participate in the entire life-cycle of data mining. This paper is our attempt to summarize the basic philosophy and design decisions leading to the current specification of OLE DB DM. A complete specification may be obtained from [9]. The rest of the paper is organized as follows. We first introduce the overall architecture and design philosophy of OLE DB for Data Mining (OLE DB DM) in Section 2. In Section 3, we discuss, with illustrative examples of their usage, the primitives and basic building blocks in the API. Section 4 presents related work and we conclude in Section Overview and Design Philosophy OLE DB DM provides a set of extensions to OLE DB to make it mining-enabled. OLE DB is an objectoriented specification for a set of data access interfaces designed for record-oriented data stores. In particular, by virtue of supporting command strings, OLE DB can provide full support to SQL. In this paper, we will primarily introduce extensions that leverage this aspect of OLE DB API to make sure that support for mining concepts are built in. Figure 1 shows how Microsoft SQL Server database management system exposes OLE DB DM. In this specific implementation, the core relational engine exposes the traditional OLE DB interface. However, the analysis server component exposes an OLE DB DM interface (along with OLE DB for OLAP) so that data mining solutions may be built on top of the analysis server. Of course, this is only one example of how OLE DB DM may be exposed. For example, in the future, data mining support may be deeply integrated in the core relational database system. Our approach in defining OLE DB DM is to maintain the SQL metaphor and to reuse many of the existing notions whenever possible. Due to the widespread use of SQL in the developer community, we decided to expose data mining interfaces in a language-based API. This decision reflects a central design principle: building data mining solutions (applications) should be easy. Thus, our design philosophy requires relational database management systems to invest resources to provide the drivers to support OLE DB DM so that data mining solutions may be developed with ease by data mining application developers. Since there are likely to be many more application developers than DBMS vendors, such a design philosophy inflicts the least pain. The key challenge is to be able to support the basic operations of data mining without introducing much change in programming model and environment that a database developer is used to. Fundamentally, this requires supporting operations on data mining models. These key operations to support on a data mining model are the ability to: 1. Define a mining model, i.e., identify the set of attributes of data to be predicted, the set of attributes to be used for prediction, and the algorithm used to build the mining model. 2. Populate a mining model from training data using the algorithm specified in (1). 3. Predict attributes for new data using a mining model that has been populated. 4. Browse a mining model for reporting and visualization applications. In developing OLE DB DM, we have decided to represent a data mining model (henceforth model or DMM) as analogous to a table in SQL. 2 Thus, it can be defined via the CREATE statement (used to create a table 2 Although viewed conceptually and programmatically as tables, models have important differences with tables. A data mining model is populated by consuming a rowset but its own internal structure can be more abstract, e.g., a decision-tree is a tree-like structure, much more compact than the training data set used to create it.

3 in SQL). It can be populated, possibly repeatedly via the INSERT INTO statement (used to add rows in a SQL table), and it can be emptied (reset) via the DELETE statement. A mining model may be used to make predictions on datasets. This is modeled through prediction join between a mining model and a data set, just as the traditional join operation is used to pull together information from multiple tables. Finally, the content of a mining model can be browsed via SELECT (used to identify a subset of data in SQL). Since the mining model is a native named object at the server, the mining model can participate in interaction with other objects using the primitives listed above (see also Section 3). Note that this achieves the desirable goal of avoiding excessive data movement, extraction, copying and thus resulting in better performance and manageability compared to the traditional mining outside the database server framework. Finally, once the DMM is created and optimized, deployment within the enterprise becomes as easy as writing SQL queries. This significantly minimizes the cost of deployment by eliminating the need for developing a proprietary infrastructure and interface to deploy the data mining solution. Model deployment and maintenance costs have received little attention in the literature, though they are an expensive and difficult task in real applications. In the next section, we review the details of the language extensions that provide an extensible framework for incorporating support for data mining. 3. Basic Components of OLE DB DM OLE DB DM has only two notions beyond traditional OLE DB definition: cases and models. Input data is in the form of a set of cases (caseset). A case captures the traditional view of an observation by machine learning algorithms as consisting of all information known about a basic entity being analyzed for mining. Of course, structurally, a set of cases is no different from a table. 3 Cases are explained further in Section 3.1. Data mining model (DMM) is treated as if it were a special type of table: (a) We associate a caseset with a DMM and additional meta-information while creating (defining) a DMM. (b) When data (in the form of cases) is inserted into the data mining model, a mining algorithm processes it and the resulting abstraction (or DMM) is saved instead of the data itself. Once a DMM is populated, it can be used for prediction, or its content can be browsed for reporting 3 However, like many other data sets, cases are more naturally represented as hierarchical data, i.e., nested tables are more natural. and visualization purposes. Fundamental operations on models include CREATE, INSERT INTO, PREDICTION JOIN, SELECT, DELETE FROM, and DROP. It should be noted that OLE DB DM has all the traditional OLE DB interfaces. In particular, schema rowsets specify the capabilities of an OLE DB DM provider. This is the standard mechanism in OLE DB whereby a provider describes information about itself to potential consumers. This information describes the supported capabilities (e.g. prediction, segmentation, sequence analysis, etc.), types of data distributions supported, limitations of the provider (size or time limitations), support for incremental model maintenance, etc. Other schema rowsets provide metadata on the columns of a mining model, on its contents, and the supported services. The details of these schema rowsets are beyond the scope of this paper. The rest of this section is organized as follows. In Section 3.1, we describe how input data is modeled as a caseset. In Section 3.2, we describe how data mining models may be defined (created) in a relational framework and the kind of meta-information on casesets that mining needs to be aware of. Section 3.3 describes operations on mining models, notably insertion (populating the model), prediction, and browsing Data as Cases To ensure that a data mining model is accurate and meaningful, data mining algorithms require that all information related to the entity to be mined be consolidated before invoking the algorithms. Traditionally, mining algorithms also view the input as a single table that represents such consolidated information where each row represents an instance of an entity. Data mining algorithms are designed so that they consume an entity instance at a time. Having all the information related to a single entity instance together in a rowset, has two important benefits. First, traditional data mining algorithms can be leveraged with relative ease. Next, it increases scalability as it eliminates the need for data mining algorithms to do considerable bookkeeping. In a relational database, data is often normalized and hence the information related to an entity scattered in multiple tables. Therefore, in order to use data mining, a key step is to be able to pull the information related to an entity into a single rowset using views. Furthermore, as is well recognized, data is often hierarchical in nature and so ideally we require the view definition language to be powerful enough to be able to produce a hierarchical rowset. This collection of data pertaining to a single entity is called a case, and the set of all relevant cases is referred to as a caseset. As mentioned above, a caseset may need

4 Customer Gender Hair Color Age Age Product Purchases Car_Ownership ID Prob. Name Quantity Type Car Prob. 1 Male Black % TV 1 Electronic Truck 100% VCR 1 Electronic Van 50% Ham 2 Food Beer 6 Beverage Table 1: A nested table representation of the 3-way join result to be hierarchical. In the rest of this section, we elaborate on these two issues through examples and further discussions. Consider a schema consisting of 3 tables representing customers. The Customers table has some customer properties. The Product Purchases table stores for each customer what products they bought. The Car Ownership table lists the cars owned by each customer. 4 Consider a customer, say Customer ID 1, who happens to be a male, has black hair, and believed to be 35 years old with 100% certainty. Assume this customer has bought a TV, a VCR, Beer (quantity 6), and Ham (quantity 2). Further assume we know this customer owns a truck (100%) and we believe he also has a van (50% certainty). Suppose one were to issue a join query over the 3 tables to get all the information we know about the set of customers. Readers familiar with SQL will quickly realize that this join query will return a table of 12 rows. Furthermore, these 12 rows contain lots of replication. More importantly, since the information about an entity instance is scattered among multiple rows, the quality of output from data mining algorithms is negatively impacted by such flattened representation. Table 1 shows how a nested table can succinctly describe the above join result. As illustrated in Table 1, a customer case can include not only simple columns but also multiple tables (nested tables as columns). Each of thesetablesinsidethecasecanhaveavariablenumberof rows and a different number of columns. Some of the columns in the example have a direct one-to-one relationship with the case (such as Gender and Age ), while others have a one-to-many relationship with the case and therefore exist in nested tables. The reader can easily identify the two tables (nested) contained in the sample case in Table 1. Product Purchases table containing the columns Product Name, Product Quantity, and Product Type. Car Ownership table containing the columns Cars and Car Probability. Nested table is a familiar notion in the relational database world, though not adopted universally in commercial systems. In OLE DB DM, we have chosen to use the Data Shaping Service 5 to generate a hierarchical rowset. It should be emphasized that it is a logical definition and does not have any implications for data storage. In other words, it is not necessary for the storage subsystem to support nested records. Cases are only instantiated as nested rowsets prior to training/predicting data mining models. Note that the same physical data may be used to generate different casesets for different analysis purposes. For example, if one chooses to mine models over all products, each product then becomes a single case and customers would need to be represented as columns of the case. Once a hierarchical caseset has been prepared using the traditional view definition mechanism and the Data Shaping Service in OLE DB, we can use this caseset to populate a data mining model or we can predict values of unknown columns using a pre-existing model. In order to do that, we need a way to represent data mining models. We illustrate this in the next section Creating/Defining Data Mining Models In OLE DB DM, a data mining model (DMM) is treated as a first class object of the provider, much like a table. In this section, our focus is on definition (creation) of data mining models. In particular, the DMM definition includes a description of the columns of data over which the model will be created, including meta-information on relationships between columns. It is assumed that a data set is represented as a nested table as discussed in Section 3.1. In Section 3.3, we will describe how other operations on data mining models may be supported in OLE DB DM. As discussed earlier, a data mining model is trained over a single caseset and is used to predict columns of a caseset. Therefore, defining (creating) a data mining model requires us to specify: Thenameofthemodel. 4 The columns Age Probability and Car_Ownership.Probability are unusual in that they represent degree of certainty. 5 The Data Shaping Service is part of the Microsoft Data Access Components (MDAC) products.

5 The algorithm (and parameters) used to build the model. The algorithm for prediction using the model. The columns the caseset must have and the relationships among these columns. Identifying columns to be used as source columns andthecolumnstobefilledbytheminingmodel ( prediction columns ). The following example illustrates the syntax: CREATE MINING MODEL [Age Prediction] ( %Name of Model [Customer ID] LONG KEY, [Gender] TEXT DISCRETE, [Age] DOUBLE DISCRETIZED PREDICT, %prediction column [Product Purchases] TABLE( [Product Name] TEXT KEY, [Quantity] DOUBLE NORMAL CONTINUOUS, [Product Type] TEXT DISCRETE RELATEDTO[ProductName])) USING [Decision_Trees_101] %Mining Algorithm used The annotations with the example identify the source columns used for prediction, the column to be predicted (age) and the mining algorithm used. Specifically, the DMM is named [Age Prediction] (the square brackets [ ] are name delimiters by convention). The PREDICT keyword indicates that the Age column is to be predicted. The [Product Purchases] is a nested table, with [Product Name] being its key. The USING clause specifies the mining algorithm to use. Parameters to the algorithm could be included as part of this clause. When we contrast this example with the CREATE TABLE syntax in SQL, a key difference is the use of substantially more meta information that adorns a column reference above. The model columns describe all of the information about a specific case. For example, assume that each case in the DMM represents a customer. The columns of the DMM will include all known and desired information about the customer (e.g. Table 1). Several kinds of information may be specified for columns: Information about the role of a column (Section 3.2.1): Is it a key? Does it contain an attribute? How does it relate to other columns or nested tables? Is it a qualifier on other columns? For columns that represent attributes, what type of attribute is it? Attribute types are described in Section Additional information about data distributions that can be provided as hints to the DMM (Section 3.2.3). The above details are discussed in the rest of this subsection and we conclude this section with a comprehensive example. Finally, note that despite the similarities with CREATE TABLE with respect to programmability, mining models are conceptually different as they represent only summarized information for a data set. Also, unlike traditional data tables, they are capable of prediction (See Section 3.3) Content Types of Columns Column Specifiers Each column in a case can represent one of the following content types: KEY: the columns that identify a row. For example, Customer ID uniquely identifies customer cases, Product Name uniquely identifies a row in Product Purchases table. ATTRIBUTE: A direct attribute of the case. This type of column represents some value for the case. Example attributes are: the age, gender, or hair-color of the customer or the quantity of a specific product the customer purchased. RELATION: Information used to classify attributes, other relations, or key columns. For example, Product Type classifies Product Name. A given relation value must always be consistent for all of the instance values of the other columns it describes for example, the product Ham must always be shown as Food for all cases. In the CREATE MINING MODEL command syntax, relations are identified in the column definition by using a RELATED TO clause to indicate the column being classified. QUALIFIER: A special value associated with an attribute that has a predefined meaning for the provider. Take for example the probability that the attribute value is correct. These qualifiers are all optional and apply only if the data has uncertainties attached to it or if the output of previous predictions is being chained as input to a subsequent DMM training step. In the CREATE MINING MODEL command syntax, modifiers are identified by using an OF clause to indicate the attribute column they modify. There are many qualifiers. We give a few examples here: (a) PROBABILITY: the probability [0,1] of the associated value. (b) VARIANCE: A number that describes the variance of the value of an attribute. (c) SUPPORT: A float that represents a weight (case replication factor) to be associated with the value. (d) PROBABILITY_VARIANCE: The variance associated with the probability estimator used for PROBABILITY. (e) ORDER: Specifies the order of a column. (See ORDERED below.)

6 (f) TABLE: A nested table in the case consists of a special column with the data type TABLE. For any given case row, the value of a TABLE type column contains the entire contents of the associated nested table. The value of a TABLE type column is in itself a table containing all of the columns for the nested table. In the CREATE MINING MODEL command syntax, nested tables are described by a set of columns, all of which are contained within the definition of a named TABLE type column Content Types of Columns Attribute Types For a column that is specified as an attribute, it is possible to further indicate the type of attribute. The following keywords specify the attribute type to the DM provider. DISCRETE: The attribute values are discrete (categorical) with no ordering implied by the values (often called states). Area Code is a good example. ORDERED: Columns that define an ordered set of values with a total ordering but no distance or magnitude semantics are implied. A ranking of skill level (say one through five) is an ordered set, but a skill level of five isn't necessarily five times better than a skill level of one. CYCLICAL: A set of values that have cyclical ordering. Day of the week is a good example, since day number one follows day number seven. CONTINOUS: Attributes with numeric values (integer of float). Values are naturally ordered and have implicit distance and magnitude semantics. Salary is a typical example. DISCRETIZED:Thedatathatwillbeinsertedinto the model is continuous, but it should be transformed into and modeled as a number of ORDERED states by the provider. Some data mining algorithms cannot accept CONTINUOUS attributes as input, or they may not be able to predict CONTINUOUS values. SEQUENCE_TIME: A time measurement range. The format is not restricted, e.g., a period number is acceptable. This is typically used to associate a sequence time with individual attribute values such as purchase time Distribution Hints An attribute, continuous or discrete, may have an associated distribution. These distributions are used as hints to the DMM and can specify prior knowledge about the data. For example, a continuous attribute can be normal (Gaussian), log normal, or uniform. A discrete can be binomial, multinomial, Poisson, mixture. Other hints could include: Not_Null which indicates there should never be a null value, model existence only which means the information of interest is not in the value of an attribute, but in the fact that a value is present. Hence Age could be modeled as binary where the significant information is whether the age is known or not Prediction Columns Output columns are columns in a dataset that the DMM is capable of predicting. Because these columns will be predicted over cases in which their values are missing or unspecified, additional statistical information may be associated with such predictions. We discuss this additional information here. Section presents the syntax of such prediction columns. Attribute or Table type columns can be input columns, output columns, or both. The data mining provider builds a DMM capable of predicting or explaining output column values based on the values of the input columns. Predictions may convey not only simple information such as estimated age is 21; but they may also convey additional statistical information associated with the prediction such as confidence level and standard deviation. Further, the prediction may actually be a collection of predictions, such as the set of products that the customer is likely to buy. Each of the predictions in the collection may also include a set of statistics. A prediction can be expressed as a histogram. A histogram provides multiple possible prediction values, each accompanied by a probability and other statistics. When histogram information is required, each prediction (which by itself can be part of a collection of predictions) may have a collection of possible values that constitutes a histogram. Since there is a wealth of information we can associate with predictions, it is often necessary to extract only a portion of the predicted information. For example, a user may want to see only the best estimate, top 3 estimates, or estimates with probability greater than 55%. Neither every provider nor every DMM can support all of the possible requests. Therefore, it is necessary for the output column to define whatever information may be extracted out of it. OLE DB DM defines a set of standard transformation functions on output columns. These functions are discussed in detail in the OLE DB DM specification document [9]. The basic mechanism we used to accommodate the flexibility in output is to draw on the familiar notion of user-defined functions (UDF) used in OLAP [10,15]. Each provider ships a set of functions that can be referenced in the prediction query. Some UDF s are scalar-valued, such as probability, or support. Others have tables as values, such as histogram and hence return nested tables when invoked Detailed Example of DMM Column Specification Now that we have described what the column models are, it may be useful to provide an example. Column descriptions allow the provider to better model the

7 training data it is given. We now return to our running example and indicate how content types of some of the columns in the example can now be classified as shown below (content type in italics): (a) Gender: discrete attribute (b) Age: continuous attribute (c) Age Probability: probability modifier of age (d) Customer Loyalty: ordered attribute (e) Product Name: Key (for Product Purchases table) (f) Product Type: Relation of Product Name The example above includes a few additional columns to illustrate a few more content types Operations on Data Model Populating a Mining Model: Insert Into Once a mining model is defined, the next step is to populate a mining model by consuming a caseset that satisfies the specification in the Create Mining Model statement. In OLE DB DM, we use INSERT to model instantiation of the mining model. As against a traditional table, insertion does not result in addition of the tuple to the rowset. Rather, the insertion corresponds to consuming the observation represented by a case using the data mining model. The following example illustrates the syntax. Note the use of the SHAPE statement to create the nested table from the input data: INSERT INTO [Age Prediction] ([Customer ID], [Gender], [Age], [Product Purchases]([Product Name], [Quantity], [Product Type])) SHAPE {SELECT [Customer ID], [Gender], [Age] FROM Customers ORDER BY [Customer ID]} APPEND ( {SELECT [CustID], [Product Name], [Quantity], [Product Type] FROM Sales ORDER BY [CustID]} RELATE [Customer ID] To [CustID]) AS [Product Purchases] Using Data Model to Predict: Prediction Join We now explain how predictions can be obtained from models in OLE DB DM that has been populated. The basic operation of obtaining prediction on a dataset D using a DMM M is modeled as a prediction join between D and M. Of course, the caseset needs to match the schema of the DMM. Since the model does not actually contain data details, the semantics of the Prediction Join are different than those of a equi-join on tables. The following example illustrates how prediction join may be used. In this example, the value for Age is not known and is being predicted using the prediction join between a data mining model and the given (test) data set. Prediction join between the cases from the test data test with the DMM and selecting the Age column from the DMM returns a predicted Age for each of the test cases. We need to use prediction join since the process of combining actual cases with all possible model cases is not as simple as the semantics of a normal SQL equi-join. For the instance when the schema of the actual case table matches the schema of the model, NATURAL PREDICTION JOIN can be used, obviating the need for the ON clause of the join. Columns from the source query will be matched to columns from the DMM based on the names of the columns. SELECT t.[customer ID], [Age Prediction].[Age] FROM [Age Prediction] PREDICTION JOIN (SHAPE { SELECT [Customer ID], [Gender], FROM Customers ORDER BY [Customer ID]} APPEND ({SELECT [CustID], [Product Name], [Quantity] FROM Sales ORDER BY [CustID]} RELATE [Customer ID] To [CustID]) AS [Product Purchases]) as t ON [Age Prediction].Gender = t.gender and [Age Prediction].[Product Purchases].[Product Name] = t.[product Purchases].[Product Name] and [Age Prediction].[Product Purchases].[Quantity] = t.[product Purchases].[Quantity] The DMM content can be thought of as a truth table containing a row for every possible combination of the distinct values for each column in the DMM. In other words, it contains every possible case along with every possible associated set of output columns. With this view in mind, a DMM can be used to look up predicted values and statistics by using the attribute values of a case as a key for the join. It is important to point out that this analogy is not quite accurate as the set of all possible cases (combinations of attribute values) is huge, and possibly infinite when one considers continuous attributes. However, this logical view allows us to map prediction into a familiar basic operation in the relational world. When a model is joined with a table, predicted values are obtained for each case that matches the model (i.e. cases for which the model is capable of making a prediction). A SELECT statement associated with the PREDICTION JOIN specifies which predicted values should be extracted and returned in the result set of the query. User-defined functions (UDF) in the projection columns are used to further enhance the set of predicted values with additional statistical information. We have discussed this aspect in Section

8 Browsing Model Content: Select In addition to listing the column structure of a DMM, a different type of browsing is to navigate the content of the model viewed as a graph. Using a set of input cases, the content of a DMM is learned by the data mining algorithm. The content of a DMM is the set of rules, formulas, classifications, distributions, nodes, or any other information that was derived from a specific set of data using a data mining technique. Content type varies according to the specific data mining technique used to create the DMM. The DMM content of a decision tree based classification will differ from a segmentation model, which, in turn, differs from a multi-regression DMM. The most popular way to express DMM content is by viewing it as a directed graph (that is, a set of nodes with connecting edges). Note that decision trees fit such a view nicely. Each node in the tree may have relationships to other nodes. A node may have one or more parent nodes and zero or more child nodes. The depth of the graph may vary depending on the specific node. Tree navigation we adopted is similar to the already defined mechanism in the OLE DB for OLAP specification [10]. On querying (SELECT * FROM <mining model>.content), the content of a mining model is exposed through the MINING_MODEL_CONTENT schema rowset. Unfortunately, a detailed discussion of MINING_MODEL_CONTENT schema rowset is beyond the scope of this paper (see Appendix A of [9]). 4. Related Work The focus of OLE DB DM is not about new KDD algorithms. Therefore, we do not discuss here the wide body of literature on KDD algorithms and scalable techniques for such algorithms. Compared to the wide body of work on KDD algorithms as well as the more recent work on scalable algorithms, there has been relatively much less work on the problem of integration of data mining with relational databases. The work by Sarawagi et al. [13] address the problem of how to implement derivation of association rules efficiently using SQL. Although significant from the performance implications, this direction of work is orthogonal to the issue of how to enable, deploy, and expose data mining models and algorithms as first class objects in the database application API. A similar effort relating to generalized association rules and sequences is also focused on implementing particular algorithms in SQL [14]. Research projects such as Quest ( [1] and DBMiner [6] ( as well as commercial mining tools provide application interfaces and UI to support mining and allow access to data in the warehouse. However, such tools do not provide the ability to deal with arbitrary mining models and integration of the mining application interfaces with SQL as well as relational data access APIs (ODBC, OLE DB). The paper by Meo et al. [8] is an example of specialized extensions to SQL to provide support for association rules. Such an approach is narrowly focused and does not provide an extensible framework for supporting models derived from other mining techniques, e.g., clustering and classification. Moreover, no support is provided for management of mining models and for using such models subsequently in other contexts. More recently, CRISP-DM [4] ( has been proposed as a standard for Data Mining process model. This initiative is complementary to OLE DB DM. CRISP-DM provides a set of methodologies, best practices, and attempts to define the various activities involved in the KDD process. It does not address issues of integration with the database system. A related effort, called Predictive Model Markup Language (PMML), provides an open standard for how models should be persisted in XML. The basic idea of PMML is to enable portability and model sharing [5]. PMML does not address database integration issue but specifies a persistence format for models. We are currently working with the PMML group to use PMML format as an open persistence format. In fact, in the model browsing methods, briefly discussed in Section 3.3.3, we use PMML inspired XML strings in exposing the content of data mining model nodes. As mentioned earlier, OLE DB DM is an extension to OLE DB, a specification of data access interfaces. OLE DB generalized the widely used ODBC connectivity standard by enabling access to record-oriented data stores that are not necessarily SQL data sources. 5. Concluding Remarks We presented the highlights of OLE DB DM, a new API for integrating data mining into SQL databases. A reference OLE DB DM provider is scheduled to ship with the next release of a commercial database system: Microsoft SQL Server The target user base of OLE DB DM is the community of enterprise developers who are familiar with SQL and database connectivity APIs such as ODBC and OLE DB. For this reason, a key design goal was to keep the set of extensions as close to SQL s look and feel as possible. OLE DB DM introduces data mining objects and entities by drawing straightforward analogies to familiar relational objects. We have only introduced two new concepts: Data Mining Models (DMM) and Cases.

9 We hope that OLE DB DM will help steer our community to establish a standard in the field. The advantages of such a standard include integration with the database so that data mining no longer needs to involve taking data outside the database, addressing issues of model deployment once models are created, and providing an infrastructure for managing data and models while leveraging the existing SQL infrastructure. Establishing standards also reduces the costs, increases the ability to interoperate and partner, and reduces the risks for both providers and consumers. Consumers are protected since they are no longer dependent on a proprietary interface so other vendors can step in to provide replacements and enhancements. Providers are protected from the need to build infrastructure from scratch as partners can step in to build different complementary components. To move towards a standard, an open forum such as standards committee elected by the community will be needed. We hope that this proposal will encourage further discussions along this direction. [10] OLE DB for OLAP, Microsoft Corporation, [11] R. Ramakrishnan and J. Gehrke. Principles of Database Management (2nd Edition), [12] J. Shanmugusundaram, U. M. Fayyad and P. S. Bradley. Compressed data cubes for OLAP aggregate query approximation on continuous dimensions. In Proc. 5 th intl. conf. on knowledge discovery and data mining, pp , [13] Sunita Sarawagi, Shiby Thomas, Rakesh Agrawal: Integrating mining with relational database systems: alternatives and implications. Proc. of SIGMOD Conference, pp , [14] S. Thomas, S. Sarawagi Mining generalized association rules and sequential patterns using SQL queries. Proc. of the 4th Int'l Conference on Knowledge Discovery in Databases and Data Mining, New York, August [15] E. Thomsen, G, Spofford, D. Chase, Microsoft OLAP solutions, Wiley, John & Sons, Acknowledgement: We are grateful to Venkatesh Ganti for his thoughtful comments on the draft. 6. References [1] R.Agrawal,A.Arning,T.Bollinger,M.Mehta,J.Shafer, R. Srikant: The Quest data mining system, Proc. of the 2nd int'l conference on knowledge discovery in databases and data mining, August, [2] R.Agrawal,H.Mannila,R.Srikant,H.ToivonenandA.I. Verkamo: Fast discovery of association rules. In Advances in knowledge discovery and data mining, pp , AAAI Press, Menlo Park, CA, [3] S. Chaudhuri and U. Dayal: An overview of data warehousing and OLAP technology. SIGMOD Record 26(1), pp , [4] P. Chapman, R. Kerber, J. Clinton, T. Khabaza, T. Reinartz, R. Wirth: The CRISP-DM process model, Technical Report [5] DMG Organization, PMML 1.0-Predictive Model Markup Languange, 1_0.html, [6] J.Han,Y.Fu,W.Wang,J.Chiang,W.Gong,K.Koperski, D. Li, Y. Lu, A. Rajan, N. Stefanovic, B. Xia, O. R. Zaiane: DBMiner: A system for mining Knowledge in Large Relational Databases. Proc. of int'l conf. on data mining and knowledge discovery (KDD'96), Portland, Oregon, pp , August [7] P. Hipson: OLE DB and ADO developer's guide, McGraw- Hill; [8] Rosa Meo, Giuseppe Psaila, Stefano Ceri: An extension to SQL for mining association rules , Data Mining and Knowledge Discovery, Vol2, Number 2, [9] Introduction to OLE DB for Data Mining, Microsoft Corporation,

Data Mining and Database Systems: Where is the Intersection?

Data Mining and Database Systems: Where is the Intersection? Data Mining and Database Systems: Where is the Intersection? Surajit Chaudhuri Microsoft Research Email: surajitc@microsoft.com 1 Introduction The promise of decision support systems is to exploit enterprise

More information

The basic data mining algorithms introduced may be enhanced in a number of ways.

The basic data mining algorithms introduced may be enhanced in a number of ways. DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Data mining algorithms have traditionally assumed data is memory resident,

More information

Efficient Integration of Data Mining Techniques in Database Management Systems

Efficient Integration of Data Mining Techniques in Database Management Systems Efficient Integration of Data Mining Techniques in Database Management Systems Fadila Bentayeb Jérôme Darmont Cédric Udréa ERIC, University of Lyon 2 5 avenue Pierre Mendès-France 69676 Bron Cedex France

More information

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH Kalinka Mihaylova Kaloyanova St. Kliment Ohridski University of Sofia, Faculty of Mathematics and Informatics Sofia 1164, Bulgaria

More information

Mining various patterns in sequential data in an SQL-like manner *

Mining various patterns in sequential data in an SQL-like manner * Mining various patterns in sequential data in an SQL-like manner * Marek Wojciechowski Poznan University of Technology, Institute of Computing Science, ul. Piotrowo 3a, 60-965 Poznan, Poland Marek.Wojciechowski@cs.put.poznan.pl

More information

Data Mining Extensions (DMX) Reference

Data Mining Extensions (DMX) Reference Data Mining Extensions (DMX) Reference SQL Server 2012 Books Online Summary: Data Mining Extensions (DMX) is a language that you can use to create and work with data mining models in Microsoft SQL Server

More information

Building Data Mining Solutions with OLE DB for DM and XML for Analysis

Building Data Mining Solutions with OLE DB for DM and XML for Analysis Building Data Mining Solutions with OLE DB for DM and XML for Analysis Zhaohui Tang, Jamie Maclennan, Peter Pyungchul Kim Microsoft SQL Server Data Mining One, Microsoft Way Redmond, WA 98007 {ZhaoTang,

More information

Building Data Cubes and Mining Them. Jelena Jovanovic Email: jeljov@fon.bg.ac.yu

Building Data Cubes and Mining Them. Jelena Jovanovic Email: jeljov@fon.bg.ac.yu Building Data Cubes and Mining Them Jelena Jovanovic Email: jeljov@fon.bg.ac.yu KDD Process KDD is an overall process of discovering useful knowledge from data. Data mining is a particular step in the

More information

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

In this presentation, you will be introduced to data mining and the relationship with meaningful use. In this presentation, you will be introduced to data mining and the relationship with meaningful use. Data mining refers to the art and science of intelligent data analysis. It is the application of machine

More information

1 File Processing Systems

1 File Processing Systems COMP 378 Database Systems Notes for Chapter 1 of Database System Concepts Introduction A database management system (DBMS) is a collection of data and an integrated set of programs that access that data.

More information

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc.

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc. Oracle9i Data Warehouse Review Robert F. Edwards Dulcian, Inc. Agenda Oracle9i Server OLAP Server Analytical SQL Data Mining ETL Warehouse Builder 3i Oracle 9i Server Overview 9i Server = Data Warehouse

More information

ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process

ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process ORACLE OLAP KEY FEATURES AND BENEFITS FAST ANSWERS TO TOUGH QUESTIONS EASILY KEY FEATURES & BENEFITS World class analytic engine Superior query performance Simple SQL access to advanced analytics Enhanced

More information

Integrating Pattern Mining in Relational Databases

Integrating Pattern Mining in Relational Databases Integrating Pattern Mining in Relational Databases Toon Calders, Bart Goethals, and Adriana Prado University of Antwerp, Belgium {toon.calders, bart.goethals, adriana.prado}@ua.ac.be Abstract. Almost a

More information

Data Mining Standards

Data Mining Standards Data Mining Standards Arati Kadav Jaya Kawale Pabitra Mitra aratik@cse.iitk.ac.in jayak@cse.iitk.ac.in pmitra@cse.iitk.ac.in Abstract In this survey paper we have consolidated all the current data mining

More information

DATA WAREHOUSING AND OLAP TECHNOLOGY

DATA WAREHOUSING AND OLAP TECHNOLOGY DATA WAREHOUSING AND OLAP TECHNOLOGY Manya Sethi MCA Final Year Amity University, Uttar Pradesh Under Guidance of Ms. Shruti Nagpal Abstract DATA WAREHOUSING and Online Analytical Processing (OLAP) are

More information

BUILDING OLAP TOOLS OVER LARGE DATABASES

BUILDING OLAP TOOLS OVER LARGE DATABASES BUILDING OLAP TOOLS OVER LARGE DATABASES Rui Oliveira, Jorge Bernardino ISEC Instituto Superior de Engenharia de Coimbra, Polytechnic Institute of Coimbra Quinta da Nora, Rua Pedro Nunes, P-3030-199 Coimbra,

More information

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS Mrs. Jyoti Nawade 1, Dr. Balaji D 2, Mr. Pravin Nawade 3 1 Lecturer, JSPM S Bhivrabai Sawant Polytechnic, Pune (India) 2 Assistant

More information

Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner

Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner 24 Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner Rekha S. Nyaykhor M. Tech, Dept. Of CSE, Priyadarshini Bhagwati College of Engineering, Nagpur, India

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

Release 2.1 of SAS Add-In for Microsoft Office Bringing Microsoft PowerPoint into the Mix ABSTRACT INTRODUCTION Data Access

Release 2.1 of SAS Add-In for Microsoft Office Bringing Microsoft PowerPoint into the Mix ABSTRACT INTRODUCTION Data Access Release 2.1 of SAS Add-In for Microsoft Office Bringing Microsoft PowerPoint into the Mix Jennifer Clegg, SAS Institute Inc., Cary, NC Eric Hill, SAS Institute Inc., Cary, NC ABSTRACT Release 2.1 of SAS

More information

SAP Business Objects Business Intelligence platform Document Version: 4.1 Support Package 7 2015-11-24. Data Federation Administration Tool Guide

SAP Business Objects Business Intelligence platform Document Version: 4.1 Support Package 7 2015-11-24. Data Federation Administration Tool Guide SAP Business Objects Business Intelligence platform Document Version: 4.1 Support Package 7 2015-11-24 Data Federation Administration Tool Guide Content 1 What's new in the.... 5 2 Introduction to administration

More information

SQL Server An Overview

SQL Server An Overview SQL Server An Overview SQL Server Microsoft SQL Server is designed to work effectively in a number of environments: As a two-tier or multi-tier client/server database system As a desktop database system

More information

Query Optimization Approach in SQL to prepare Data Sets for Data Mining Analysis

Query Optimization Approach in SQL to prepare Data Sets for Data Mining Analysis Query Optimization Approach in SQL to prepare Data Sets for Data Mining Analysis Rajesh Reddy Muley 1, Sravani Achanta 2, Prof.S.V.Achutha Rao 3 1 pursuing M.Tech(CSE), Vikas College of Engineering and

More information

A Model-based Software Architecture for XML Data and Metadata Integration in Data Warehouse Systems

A Model-based Software Architecture for XML Data and Metadata Integration in Data Warehouse Systems Proceedings of the Postgraduate Annual Research Seminar 2005 68 A Model-based Software Architecture for XML and Metadata Integration in Warehouse Systems Abstract Wan Mohd Haffiz Mohd Nasir, Shamsul Sahibuddin

More information

Metadata Management for Data Warehouse Projects

Metadata Management for Data Warehouse Projects Metadata Management for Data Warehouse Projects Stefano Cazzella Datamat S.p.A. stefano.cazzella@datamat.it Abstract Metadata management has been identified as one of the major critical success factor

More information

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP Data Warehousing and End-User Access Tools OLAP and Data Mining Accompanying growth in data warehouses is increasing demands for more powerful access tools providing advanced analytical capabilities. Key

More information

Data Warehousing and Decision Support. Introduction. Three Complementary Trends. Chapter 23, Part A

Data Warehousing and Decision Support. Introduction. Three Complementary Trends. Chapter 23, Part A Data Warehousing and Decision Support Chapter 23, Part A Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke 1 Introduction Increasingly, organizations are analyzing current and historical

More information

ANALYSIS OF WEBSITE USAGE WITH USER DETAILS USING DATA MINING PATTERN RECOGNITION

ANALYSIS OF WEBSITE USAGE WITH USER DETAILS USING DATA MINING PATTERN RECOGNITION ANALYSIS OF WEBSITE USAGE WITH USER DETAILS USING DATA MINING PATTERN RECOGNITION K.Vinodkumar 1, Kathiresan.V 2, Divya.K 3 1 MPhil scholar, RVS College of Arts and Science, Coimbatore, India. 2 HOD, Dr.SNS

More information

Database Marketing, Business Intelligence and Knowledge Discovery

Database Marketing, Business Intelligence and Knowledge Discovery Database Marketing, Business Intelligence and Knowledge Discovery Note: Using material from Tan / Steinbach / Kumar (2005) Introduction to Data Mining,, Addison Wesley; and Cios / Pedrycz / Swiniarski

More information

Lightweight Data Integration using the WebComposition Data Grid Service

Lightweight Data Integration using the WebComposition Data Grid Service Lightweight Data Integration using the WebComposition Data Grid Service Ralph Sommermeier 1, Andreas Heil 2, Martin Gaedke 1 1 Chemnitz University of Technology, Faculty of Computer Science, Distributed

More information

Database Application Developer Tools Using Static Analysis and Dynamic Profiling

Database Application Developer Tools Using Static Analysis and Dynamic Profiling Database Application Developer Tools Using Static Analysis and Dynamic Profiling Surajit Chaudhuri, Vivek Narasayya, Manoj Syamala Microsoft Research {surajitc,viveknar,manojsy}@microsoft.com Abstract

More information

CUSTOMER Presentation of SAP Predictive Analytics

CUSTOMER Presentation of SAP Predictive Analytics SAP Predictive Analytics 2.0 2015-02-09 CUSTOMER Presentation of SAP Predictive Analytics Content 1 SAP Predictive Analytics Overview....3 2 Deployment Configurations....4 3 SAP Predictive Analytics Desktop

More information

Introduction. A. Bellaachia Page: 1

Introduction. A. Bellaachia Page: 1 Introduction 1. Objectives... 3 2. What is Data Mining?... 4 3. Knowledge Discovery Process... 5 4. KD Process Example... 7 5. Typical Data Mining Architecture... 8 6. Database vs. Data Mining... 9 7.

More information

Fast and Easy Delivery of Data Mining Insights to Reporting Systems

Fast and Easy Delivery of Data Mining Insights to Reporting Systems Fast and Easy Delivery of Data Mining Insights to Reporting Systems Ruben Pulido, Christoph Sieb rpulido@de.ibm.com, christoph.sieb@de.ibm.com Abstract: During the last decade data mining and predictive

More information

Physical Database Design Process. Physical Database Design Process. Major Inputs to Physical Database. Components of Physical Database Design

Physical Database Design Process. Physical Database Design Process. Major Inputs to Physical Database. Components of Physical Database Design Physical Database Design Process Physical Database Design Process The last stage of the database design process. A process of mapping the logical database structure developed in previous stages into internal

More information

Data Warehousing and OLAP Technology for Knowledge Discovery

Data Warehousing and OLAP Technology for Knowledge Discovery 542 Data Warehousing and OLAP Technology for Knowledge Discovery Aparajita Suman Abstract Since time immemorial, libraries have been generating services using the knowledge stored in various repositories

More information

DATABASE MANAGEMENT SYSTEM

DATABASE MANAGEMENT SYSTEM REVIEW ARTICLE DATABASE MANAGEMENT SYSTEM Sweta Singh Assistant Professor, Faculty of Management Studies, BHU, Varanasi, India E-mail: sweta.v.singh27@gmail.com ABSTRACT Today, more than at any previous

More information

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10 1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

More information

Dimensional Modeling for Data Warehouse

Dimensional Modeling for Data Warehouse Modeling for Data Warehouse Umashanker Sharma, Anjana Gosain GGS, Indraprastha University, Delhi Abstract Many surveys indicate that a significant percentage of DWs fail to meet business objectives or

More information

Database Systems. Lecture 1: Introduction

Database Systems. Lecture 1: Introduction Database Systems Lecture 1: Introduction General Information Professor: Leonid Libkin Contact: libkin@ed.ac.uk Lectures: Tuesday, 11:10am 1 pm, AT LT4 Website: http://homepages.inf.ed.ac.uk/libkin/teach/dbs09/index.html

More information

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 16 - Data Warehousing

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 16 - Data Warehousing CSE 544 Principles of Database Management Systems Magdalena Balazinska Fall 2007 Lecture 16 - Data Warehousing Class Projects Class projects are going very well! Project presentations: 15 minutes On Wednesday

More information

Chapter 20: Data Analysis

Chapter 20: Data Analysis Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification

More information

Data Warehouse: Introduction

Data Warehouse: Introduction Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of Base and Mining Group of base and data mining group,

More information

A Comparison of Database Query Languages: SQL, SPARQL, CQL, DMX

A Comparison of Database Query Languages: SQL, SPARQL, CQL, DMX ISSN: 2393-8528 Contents lists available at www.ijicse.in International Journal of Innovative Computer Science & Engineering Volume 3 Issue 2; March-April-2016; Page No. 09-13 A Comparison of Database

More information

How to Enhance Traditional BI Architecture to Leverage Big Data

How to Enhance Traditional BI Architecture to Leverage Big Data B I G D ATA How to Enhance Traditional BI Architecture to Leverage Big Data Contents Executive Summary... 1 Traditional BI - DataStack 2.0 Architecture... 2 Benefits of Traditional BI - DataStack 2.0...

More information

Outlines. Business Intelligence. What Is Business Intelligence? Data mining life cycle

Outlines. Business Intelligence. What Is Business Intelligence? Data mining life cycle Outlines Business Intelligence Lecture 15 Why integrate BI into your smart client application? Integrating Mining into your application Integrating into your application What Is Business Intelligence?

More information

Databases in Organizations

Databases in Organizations The following is an excerpt from a draft chapter of a new enterprise architecture text book that is currently under development entitled Enterprise Architecture: Principles and Practice by Brian Cameron

More information

Mining Association Rules: A Database Perspective

Mining Association Rules: A Database Perspective IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.12, December 2008 69 Mining Association Rules: A Database Perspective Dr. Abdallah Alashqur Faculty of Information Technology

More information

SPATIAL DATA CLASSIFICATION AND DATA MINING

SPATIAL DATA CLASSIFICATION AND DATA MINING , pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal

More information

NEW TECHNIQUE TO DEAL WITH DYNAMIC DATA MINING IN THE DATABASE

NEW TECHNIQUE TO DEAL WITH DYNAMIC DATA MINING IN THE DATABASE www.arpapress.com/volumes/vol13issue3/ijrras_13_3_18.pdf NEW TECHNIQUE TO DEAL WITH DYNAMIC DATA MINING IN THE DATABASE Hebah H. O. Nasereddin Middle East University, P.O. Box: 144378, Code 11814, Amman-Jordan

More information

University of Gaziantep, Department of Business Administration

University of Gaziantep, Department of Business Administration University of Gaziantep, Department of Business Administration The extensive use of information technology enables organizations to collect huge amounts of data about almost every aspect of their businesses.

More information

From Databases to Natural Language: The Unusual Direction

From Databases to Natural Language: The Unusual Direction From Databases to Natural Language: The Unusual Direction Yannis Ioannidis Dept. of Informatics & Telecommunications, MaDgIK Lab University of Athens, Hellas (Greece) yannis@di.uoa.gr http://www.di.uoa.gr/

More information

Semantic Analysis of Business Process Executions

Semantic Analysis of Business Process Executions Semantic Analysis of Business Process Executions Fabio Casati, Ming-Chien Shan Software Technology Laboratory HP Laboratories Palo Alto HPL-2001-328 December 17 th, 2001* E-mail: [casati, shan] @hpl.hp.com

More information

In-Database Analytics

In-Database Analytics Embedding Analytics in Decision Management Systems In-database analytics offer a powerful tool for embedding advanced analytics in a critical component of IT infrastructure. James Taylor CEO CONTENTS Introducing

More information

Data Mining is the process of knowledge discovery involving finding

Data Mining is the process of knowledge discovery involving finding using analytic services data mining framework for classification predicting the enrollment of students at a university a case study Data Mining is the process of knowledge discovery involving finding hidden

More information

MINING CLICKSTREAM-BASED DATA CUBES

MINING CLICKSTREAM-BASED DATA CUBES MINING CLICKSTREAM-BASED DATA CUBES Ronnie Alves and Orlando Belo Departament of Informatics,School of Engineering, University of Minho Campus de Gualtar, 4710-057 Braga, Portugal Email: {alvesrco,obelo}@di.uminho.pt

More information

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 15 - Data Warehousing: Cubes

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 15 - Data Warehousing: Cubes CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2009 Lecture 15 - Data Warehousing: Cubes Final Exam Overview Open books and open notes No laptops and no other mobile devices

More information

ETL-EXTRACT, TRANSFORM & LOAD TESTING

ETL-EXTRACT, TRANSFORM & LOAD TESTING ETL-EXTRACT, TRANSFORM & LOAD TESTING Rajesh Popli Manager (Quality), Nagarro Software Pvt. Ltd., Gurgaon, INDIA rajesh.popli@nagarro.com ABSTRACT Data is most important part in any organization. Data

More information

Dynamic Data in terms of Data Mining Streams

Dynamic Data in terms of Data Mining Streams International Journal of Computer Science and Software Engineering Volume 2, Number 1 (2015), pp. 1-6 International Research Publication House http://www.irphouse.com Dynamic Data in terms of Data Mining

More information

ICOM 6005 Database Management Systems Design. Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001

ICOM 6005 Database Management Systems Design. Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001 ICOM 6005 Database Management Systems Design Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001 Readings Read Chapter 1 of text book ICOM 6005 Dr. Manuel

More information

Jet Data Manager 2012 User Guide

Jet Data Manager 2012 User Guide Jet Data Manager 2012 User Guide Welcome This documentation provides descriptions of the concepts and features of the Jet Data Manager and how to use with them. With the Jet Data Manager you can transform

More information

Data Mining. Vera Goebel. Department of Informatics, University of Oslo

Data Mining. Vera Goebel. Department of Informatics, University of Oslo Data Mining Vera Goebel Department of Informatics, University of Oslo 2011 1 Lecture Contents Knowledge Discovery in Databases (KDD) Definition and Applications OLAP Architectures for OLAP and KDD KDD

More information

Indexing Techniques for Data Warehouses Queries. Abstract

Indexing Techniques for Data Warehouses Queries. Abstract Indexing Techniques for Data Warehouses Queries Sirirut Vanichayobon Le Gruenwald The University of Oklahoma School of Computer Science Norman, OK, 739 sirirut@cs.ou.edu gruenwal@cs.ou.edu Abstract Recently,

More information

September 2009 Cloud Storage for Cloud Computing

September 2009 Cloud Storage for Cloud Computing September 2009 Cloud Storage for Cloud Computing This paper is a joint production of the Storage Networking Industry Association and the Open Grid Forum. Copyright 2009 Open Grid Forum, Copyright 2009

More information

Clustering through Decision Tree Construction in Geology

Clustering through Decision Tree Construction in Geology Nonlinear Analysis: Modelling and Control, 2001, v. 6, No. 2, 29-41 Clustering through Decision Tree Construction in Geology Received: 22.10.2001 Accepted: 31.10.2001 A. Juozapavičius, V. Rapševičius Faculty

More information

A Mind Map Based Framework for Automated Software Log File Analysis

A Mind Map Based Framework for Automated Software Log File Analysis 2011 International Conference on Software and Computer Applications IPCSIT vol.9 (2011) (2011) IACSIT Press, Singapore A Mind Map Based Framework for Automated Software Log File Analysis Dileepa Jayathilake

More information

Oracle Data Miner (Extension of SQL Developer 4.0)

Oracle Data Miner (Extension of SQL Developer 4.0) An Oracle White Paper September 2013 Oracle Data Miner (Extension of SQL Developer 4.0) Integrate Oracle R Enterprise Mining Algorithms into a workflow using the SQL Query node Denny Wong Oracle Data Mining

More information

SUGI 29 Systems Architecture. Paper 223-29

SUGI 29 Systems Architecture. Paper 223-29 Paper 223-29 SAS Add-In for Microsoft Office Leveraging SAS Throughout the Organization from Microsoft Office Jennifer Clegg, SAS Institute Inc., Cary, NC Stephen McDaniel, SAS Institute Inc., Cary, NC

More information

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig Contents Acknowledgements... 1 Introduction to Hive and Pig... 2 Setup... 2 Exercise 1 Load Avro data into HDFS... 2 Exercise 2 Define an

More information

University Data Warehouse Design Issues: A Case Study

University Data Warehouse Design Issues: A Case Study Session 2358 University Data Warehouse Design Issues: A Case Study Melissa C. Lin Chief Information Office, University of Florida Abstract A discussion of the design and modeling issues associated with

More information

Chapter ML:XI. XI. Cluster Analysis

Chapter ML:XI. XI. Cluster Analysis Chapter ML:XI XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster

More information

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

low-level storage structures e.g. partitions underpinning the warehouse logical table structures DATA WAREHOUSE PHYSICAL DESIGN The physical design of a data warehouse specifies the: low-level storage structures e.g. partitions underpinning the warehouse logical table structures low-level structures

More information

Universal PMML Plug-in for EMC Greenplum Database

Universal PMML Plug-in for EMC Greenplum Database Universal PMML Plug-in for EMC Greenplum Database Delivering Massively Parallel Predictions Zementis, Inc. info@zementis.com USA: 6125 Cornerstone Court East, Suite #250, San Diego, CA 92121 T +1(619)

More information

DBMS / Business Intelligence, SQL Server

DBMS / Business Intelligence, SQL Server DBMS / Business Intelligence, SQL Server Orsys, with 30 years of experience, is providing high quality, independant State of the Art seminars and hands-on courses corresponding to the needs of IT professionals.

More information

Creating BI solutions with BISM Tabular. Written By: Dan Clark

Creating BI solutions with BISM Tabular. Written By: Dan Clark Creating BI solutions with BISM Tabular Written By: Dan Clark CONTENTS PAGE 3 INTRODUCTION PAGE 4 PAGE 5 PAGE 7 PAGE 8 PAGE 9 PAGE 9 PAGE 11 PAGE 12 PAGE 13 PAGE 14 PAGE 17 SSAS TABULAR MODE TABULAR MODELING

More information

Outline. What is Big data and where they come from? How we deal with Big data?

Outline. What is Big data and where they come from? How we deal with Big data? What is Big Data Outline What is Big data and where they come from? How we deal with Big data? Big Data Everywhere! As a human, we generate a lot of data during our everyday activity. When you buy something,

More information

Exploiting Key Answers from Your Data Warehouse Using SAS Enterprise Reporter Software

Exploiting Key Answers from Your Data Warehouse Using SAS Enterprise Reporter Software Exploiting Key Answers from Your Data Warehouse Using SAS Enterprise Reporter Software Donna Torrence, SAS Institute Inc., Cary, North Carolina Juli Staub Perry, SAS Institute Inc., Cary, North Carolina

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:

More information

Oracle Database 10g: Building GIS Applications Using the Oracle Spatial Network Data Model. An Oracle Technical White Paper May 2005

Oracle Database 10g: Building GIS Applications Using the Oracle Spatial Network Data Model. An Oracle Technical White Paper May 2005 Oracle Database 10g: Building GIS Applications Using the Oracle Spatial Network Data Model An Oracle Technical White Paper May 2005 Building GIS Applications Using the Oracle Spatial Network Data Model

More information

Oracle Warehouse Builder 10g

Oracle Warehouse Builder 10g Oracle Warehouse Builder 10g Architectural White paper February 2004 Table of contents INTRODUCTION... 3 OVERVIEW... 4 THE DESIGN COMPONENT... 4 THE RUNTIME COMPONENT... 5 THE DESIGN ARCHITECTURE... 6

More information

JOURNAL OF OBJECT TECHNOLOGY

JOURNAL OF OBJECT TECHNOLOGY JOURNAL OF OBJECT TECHNOLOGY Online at http://www.jot.fm. Published by ETH Zurich, Chair of Software Engineering JOT, 2005 Vol. 4, No.2, March-April 2005 On Metadata Management Technology: Status and Issues

More information

Data Discovery & Documentation PROCEDURE

Data Discovery & Documentation PROCEDURE Data Discovery & Documentation PROCEDURE Document Version: 1.0 Date of Issue: June 28, 2013 Table of Contents 1. Introduction... 3 1.1 Purpose... 3 1.2 Scope... 3 2. Option 1: Current Process No metadata

More information

Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis IOSR Journal of Computer Engineering (IOSRJCE) ISSN: 2278-0661, ISBN: 2278-8727 Volume 6, Issue 5 (Nov. - Dec. 2012), PP 36-41 Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

More information

Semantic Stored Procedures Programming Environment and performance analysis

Semantic Stored Procedures Programming Environment and performance analysis Semantic Stored Procedures Programming Environment and performance analysis Marjan Efremov 1, Vladimir Zdraveski 2, Petar Ristoski 2, Dimitar Trajanov 2 1 Open Mind Solutions Skopje, bul. Kliment Ohridski

More information

Foundations of Business Intelligence: Databases and Information Management

Foundations of Business Intelligence: Databases and Information Management Foundations of Business Intelligence: Databases and Information Management Problem: HP s numerous systems unable to deliver the information needed for a complete picture of business operations, lack of

More information

Programmabilty. Programmability in Microsoft Dynamics AX 2009. Microsoft Dynamics AX 2009. White Paper

Programmabilty. Programmability in Microsoft Dynamics AX 2009. Microsoft Dynamics AX 2009. White Paper Programmabilty Microsoft Dynamics AX 2009 Programmability in Microsoft Dynamics AX 2009 White Paper December 2008 Contents Introduction... 4 Scenarios... 4 The Presentation Layer... 4 Business Intelligence

More information

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Mobile Phone APP Software Browsing Behavior using Clustering Analysis Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Mobile Phone APP Software Browsing Behavior using Clustering Analysis

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume, Issue, March 201 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com An Efficient Approach

More information

Data Mining as Part of Knowledge Discovery in Databases (KDD)

Data Mining as Part of Knowledge Discovery in Databases (KDD) Mining as Part of Knowledge Discovery in bases (KDD) Presented by Naci Akkøk as part of INF4180/3180, Advanced base Systems, fall 2003 (based on slightly modified foils of Dr. Denise Ecklund from 6 November

More information

Decision Support. Chapter 23. Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke 1

Decision Support. Chapter 23. Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke 1 Decision Support Chapter 23 Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke 1 Introduction Increasingly, organizations are analyzing current and historical data to identify useful

More information

Data Mining for Everyone

Data Mining for Everyone Page 1 Data Mining for Everyone Christoph Sieb Senior Software Engineer, Data Mining Development Dr. Andreas Zekl Manager, Data Mining Development Page 2 Executive Summary Contents 2 Data mining in the

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

TopBraid Insight for Life Sciences

TopBraid Insight for Life Sciences TopBraid Insight for Life Sciences In the Life Sciences industries, making critical business decisions depends on having relevant information. However, queries often have to span multiple sources of information.

More information

HP Quality Center. Upgrade Preparation Guide

HP Quality Center. Upgrade Preparation Guide HP Quality Center Upgrade Preparation Guide Document Release Date: November 2008 Software Release Date: November 2008 Legal Notices Warranty The only warranties for HP products and services are set forth

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design

More information

Data Warehousing. Paper 133-25

Data Warehousing. Paper 133-25 Paper 133-25 The Power of Hybrid OLAP in a Multidimensional World Ann Weinberger, SAS Institute Inc., Cary, NC Matthias Ender, SAS Institute Inc., Cary, NC ABSTRACT Version 8 of the SAS System brings powerful

More information

Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL

Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL Jasna S MTech Student TKM College of engineering Kollam Manu J Pillai Assistant Professor

More information

Data Mining as an Automated Service

Data Mining as an Automated Service Data Mining as an Automated Service P. S. Bradley Apollo Data Technologies, LLC paul@apollodatatech.com February 16, 2003 Abstract An automated data mining service offers an out- sourced, costeffective

More information

Business Intelligence in E-Learning

Business Intelligence in E-Learning Business Intelligence in E-Learning (Case Study of Iran University of Science and Technology) Mohammad Hassan Falakmasir 1, Jafar Habibi 2, Shahrouz Moaven 1, Hassan Abolhassani 2 Department of Computer

More information