Nagarjuna College Of Information Technology (Bachelor in Information Management) TRIBHUVAN UNIVERSITY Project Report on World s successful data mining and data warehousing projects Submitted By: Submitted To: Submission Date:
Data Mining Data mining is the process of analyzing data from different perspectives and summarizing it into useful information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. Data mining is becoming an increasingly important tool to transform the data into information. It is commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and scientific discovery. Data Warehouse A data warehouse is a repository of an organization's electronically stored data, designed to facilitate reporting and analysis. The data warehouse focuses on data storage. However, the means to retrieve and analyze data, to extract, transform and load data, and to manage the data dictionary are also considered essential components of a data warehousing system. Many references to data warehousing use this broader context. Thus, an expanded definition for data warehousing includes business intelligence tools, tools to extract, transform and load data into the repository, and tools to manage and retrieve metadata. Data warehousing arises in an organization's need for reliable, consolidated, unique and integrated analysis and reporting of its data, at different levels of aggregation. Data mining Challenges To be successful, data mining requires the right team, the right methodology, the right architecture, and the right technology.
1. The Right Team Data mining projects must be a collaborative effort driven by business experts, developed by analytic modelers and supported by IT. Internal skill sets may be developed over time, which may mean initially hiring data mining consultants to develop your data mining capability with the ultimate objective of transferring knowledge to the team. To ensure a successful data mining outcome, it will need the following three classes of experts on the team: business domain experts, information technology support, and analytic modelers/data marts. 2. The Right Methodology Data mining is an ongoing process that must be maintained and changed as business drivers change. The key to a successful project is to base it on a proven methodology. Below is a data mining methodology that has delivered successful models that have uncovered millions of dollars in revenue and cost savings for customers. This section defines the data mining methodology. 3. The Right Architecture There are several data mining architectures commonly used today. They include the distributed independent data mart, data warehouse with dependent data marts, and the centralized data warehouse and mining architectures. In the data mining technique the primary architecture are process architecture and system architecture. These architecture should clearly defined. 4. The Right Technology The right technology begins with the right foundation: the database. Effective data mining depends on a comprehensive and robust data warehouse, not a summarized data mart, because it s difficult to predict the specific attributes that will contribute to a data mining model. Some companies are trying to do data warehousing with a database that was designed for OLTP operational processing of high-speed transactions. The operations performed in databases optimized for OLTP adding, deleting, modifying records, and other row-level update functions are quite different from those that are necessary to analyze large volumes of historical data, and therefore require different database capabilities
Data mining Project in Biosteel Here is the one successful data mining project of Baoshan Iron and Steel Co. Ltd. that tells us that what they do for the success of the project. 1. Introduction There are lots of problems in the operation process of metallurgical industry needed to solve, such as integrated quality control and supply chain management. Because of their multivariable and nonlinear properties, it is difficult to achieve the optimum at enterprise level by using traditional local optimizing method. The data distributing in all parts of plants are organized into data warehouse. Based on it, data mining is carried out, and the knowledge acquired from data is applied to practical control and management system, doing things better than before. 2. Data mining methodology The data mining methodology can be regarded as the meta knowledge of data mining, which shows the direction from data to knowledge. In general, the workflow of data mining can be divided into three steps: data preparation, data mining (in narrow sense), and result interpretation as shown in Figure 1. At first, data preparation provides data mining with appropriate data. Afterward data mining uses a set of algorithms to extract patterns or models from data. In the end, field experts give explanations, to convert the patterns or models into knowledge and guide daily work. Figure 1: The general workflow of data mining For metallurgical industry process field, a set of data mining methodology named SEMMAO is adopted as shown in Figure 2, which can be divided into 6 steps: sampling (S), exploring (E), modifying (M), modeling (M), assessing (A) and optimizing (O); an approach to extract knowledge from data step by step. SEMMAO methodology is derived from data mining practice in Baosteel and proved effective.
Figure 2: SEMMAO methodology The data source of data mining is data warehouse (at enterprise level) or data mart (at business division or department level). It is emphasized that data mining should be based on data warehouse rather than traditional database management system (DBMS) because of their different orientations. More specifically, DBMS has usually been used to create operational databases and on-line transaction processing (OLTP) systems. In contrast, for the purpose of statistical analysis, data mining and on-line analytical processing (OLAP), a non-standardized data structure is required. Thus data warehouse is born from the reorganization of database. The sampling step selects some samples from a large sample set according to the specific rule. It could be random sampling or nonrandom sampling. The goal of sampling is to reduce the amount of the data for next steps, and to improve the distribution of the data. The exploring step does some visual explorations to data. It can help the analyst to get acquaintance to the distribution of the data, providing useful hints for the following steps. The modifying step adjusts dissatisfactory data to meet the requirement of modeling algorithms. There are lots of modifying methods, such as missing data processing, outlier processing, contradiction processing, data standardization, variable transformation, and so on. The modeling step extracts knowledge from data with mathematical model. All models can be divided into two categories: supervised model and unsupervised model. In supervised mode, the target variables have given values. In unsupervised
mode, the target variables are absent, and accordingly data samples are divided into several clusters by only using the information of input variables, which can be also used for classification. The assessing step reports the results of modeling, error analysis and assessment of the models. As soon as being proved acceptable, models can be considered as a sort of knowledge and used for forecasting and optimizing later. The optimizing step utilizes acquired knowledge to solve practical problem. It answers questions such as "how to set the values of input variables to meet the goals of target variables". After foregoing steps, the knowledge derived from practical data is applied in producing process, bringing out new data again. Thus it forms a cycle to promote production capability continuously. 3. Data mining software tools There are lots of commercial softwares of data mining.two data mining software tools are introduced in this paper. One is Practical Miner (shortly PM); the other is SAS Enterprise Miner (shortly SAS/EM). They are proved useful by practical applications in their company. Practical Miner is a simple and practical data mining software tool, just like an automatic camera, which completes all work with just one push. It is developed by a group of Baosteel Research Institute according to SEMMAO methodology. PM is based on basic SAS platform. SAS is selected as developing and running environment, because it is the best statistical software and popular in various applications. PM has powerful function, covering the whole data mining process from data preprocessing to data presentation. Moreover, PM affords user-friendly interface, and with its Chinese help system, users can easily handle whether they are familiar with data mining technology or not. But they chose SAS/EM to data mining professional. The latest delivery version of SAS/EM was 4.2. It adapts object-oriented visual programming technology, and contains most algorithms of data mining. As powerful data mining software, SAS/EM has stricter requirement on users, who need extensive statistics knowledge.
4. Some applications in Baosteel Baosteel has accumulated lots of production data since it launched production in 1985. As the leader in steel industry, Baosteel has carried out the research and application on data warehouse and data mining keeping pace with the latest international development. Through several years' efforts, an enterprise data warehouse has been constructed. A series of data mining research and application have been taken based on it. The widest data mining applications in Baosteel focus on quality control. The first data mining case was a project of ship plate quality analysis, in which some key variables were found to improve the product quality. It helped the ship plate to get the certificate of international ship organizations, such as LR, BV, RINA, and DnV. After it, Baosteel Manufacturing Management Department applied data mining to the quality control of hot rolling mill and cold rolling mill, with profit exceeding 30 million RMB in 2001.Baosteel was entitled to top National Quality Control Award in 2001. There are also some other successful data mining cases in manufacturing management. The most profitable project of data mining in Baosteel is the optimization of iron ore mixing. The proportion of different iron ores was optimized, reducing production cost as well as assuring quality, bringing Baosteel annual profit of 60 million RMB. Data mining was also applied to the analysis of rolling plan, aiming to improve the hit rates of contracts. In addition, some work was done to optimize inventory structure to cut down inventory cost and balance resources. Data mining is applied to production process control too. For example, in the hot rolling process, a rolling stress prediction model was built by data mining. Furthermore, data mining has taken effect on enterprise marketing and sales management. On the one hand, Baosteel implements shipment by week for some important customers based on data mining in shipment period, speeding up supply chain response and improving customer service quality. On the other hand, a customer-oriented supply chain management application is under construction, whose benchmark values will be extracted from data warehouse by data mining.
5. Conclusion They discuss the data mining methodology and software tools in the manufacturing management of metallurgical industry, and introduces some practical applications in Baosteel from. As participants in the field for years, they share their experience as: a. Data mining can bring profits to conventional industry enterprise in fact. Acquiring hidden knowledge by data mining, we can promote informatization level, and convert potential productivity into realized productivity. b. Data mining is driven by application. The selection of methodology and software tools must serve for solving practical problems. Application projects can succeed based on the seamless cooperation between data mining professionals and end users. c. The knowledge discovered by data mining must be applied to problems in real world. It is the ultimate goal of informatization. 6. Other successful data mining projects Texas A&M University, College Station, TX used the data mining technique to investigate the Open Source Software (OSS) success. In this project they want to know the best way of model formulation, validation techniques, and testing approach of the software. They use the predictive modeling techniques of Logistic Regression (LR), Decision Trees (DT) and Neural Networks (NN) together for their analysis. After the use of these techniques for data analysis, the findings are used for the model formulation, validation, and testing, they get more successful than their previous research projects. According to the preliminary findings of this research, the projects that were created before the year 2003 were lesslikely to succeed as compared to the more recent projects that use data mining technique. One of the reasons can be that OSS movement isbecoming more popular and the newer projects offer more promise to developers and the users compared to theolder projects. This would also imply that with time, OSS teams are improving their project management process. Another important finding is that the number of downloads are positively related to success. Projects that have more downloads are more likely to succeed. The
number of bugs reported has a positive relationship to success. Therefore, the higher the number of bugsreported, implies that the software is being used and therefore has a positive relationship to success. The number of bugs open is an indicator of the inability of the project team to fix the bugs; therefore it has a negative impact onsuccess. The team size has a positive impact on success, so the bigger the team size, the probability of success ofthe project increases. OSS projects also have the option to use a project manager or not. Use of project managementmethods has a positive impact on success of the project.