CHAPTER 5 APPLICATION OF DATA MINING TECHNIQUES IN PHARMACEUTICAL FORMULATION

Transcription

1 138 CHAPTER 5 APPLICATION OF DATA MINING TECHNIQUES IN PHARMACEUTICAL FORMULATION 5.1 INTRODUCTION After 1990, it became apparent that a better data-driven method was needed to identify projects and generate charters with a greater potential for success. Data Mining and Modeling (DMM), the methodology of finding relationships between inputs and outputs (modeling) and converting this exploratory model into value, was identified as a viable approach for accomplishing this and a team was formed to bring knowledge about the DMM methodology into. This mindset has provided a fertile ground for making the case of data mining and facilitating its acceptance as a natural extension. Upper management in particular was much more open to give data mining a try and provide adequate resources to launch it at a significant level. 5.2 UNIQUE DATA MINING NEEDS OF A GLOBAL CHEMICAL COMPANY Although the company shares some of the same issues and concerns as other companies when it comes to data mining, it is also different in many ways than the companies where traditional data mining has been largely applied of the sort done in insurance, banking, credit card, financial institutions and the large retail or on-line stores, where millions of transactions may take place on a daily basis. It is probably fair to say that

2 139 much attention has been devoted by vendors to provide tools and methods to address these types of problems. And although other industries can be the beneficiaries of such developments, still there are unique requirements driving the need for a different sort of data mining approaches: In contrast to the huge terabyte data sets, for example, we generally deal with smaller gigabyte-size sets of greater variety, such as manufacturing process data, research and development data for new material and product development, marketing and business data and supply-chain data of a globally distributed company dealing with multinational regulatory issues. The company is essentially a collection of businesses, each with different needs and objectives, yet they share in many of the benefits that come from large scale implementation and integration, especially at the geographic site level, through pooled resources and services. The end result of all of this is that any widely deployed data mining methodology and tools must be general enough and flexible enough to accommodate diverse needs, in order to solve local problems effectively, while avoiding sub-optimization 5.3 OVERVIEW OF DATA MINING The development of Information Technology has generated large amount of databases and huge data in various areas. The research in databases and information technology has given rise to an approach to store and manipulate this precious data for further decision making. Data mining is a process of extraction of useful information and patterns from huge data. It is also called as knowledge discovery process, knowledge mining from data, knowledge extraction or data /pattern analysis. To generate information it requires massive collection of data. The data can be simple numerical figures and text documents, to more complex information such as spatial data, multimedia data, and hypertext documents. To take complete advantage of data; the data retrieval is simply not enough, it requires a tool for automatic

3 140 summarization of data, extraction of the essence of information stored, and the discovery of patterns in raw data. With the enormous amount of data stored in files, databases, and other repositories, it is increasingly important, to develop powerful tool for analysis and interpretation of such data and for the extraction of interesting knowledge that could help in decision-making. The only answer to all above is Data Mining. Data mining is the extraction of hidden predictive information from large databases; it is a powerful technology with great potential to help organizations focus on the most important information in their data warehouses (Fayyad 1996). Data mining tools predict future trends and behaviors, helps organizations to make proactive knowledge-driven decisions (Fayyad 1996). The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems. Data mining tools can answer the questions that traditionally were too time consuming to resolve. They prepare databases for finding hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Data mining, popularly known as Knowledge Discovery in Databases (KDD), it is the nontrivial extraction of implicit, previously unknown and potentially useful information from data in databases (Dunham 2006). Though, data mining and knowledge discovery in databases (or KDD) are frequently treated as synonyms, data mining is actually part of the knowledge discovery process (Fayyad 1996). Data mining is a logical process that is used to search through large amount of data in order to find useful data. The goal of this technique is to find patterns that were previously unknown. Once these patterns are found they can further be used to make certain decisions for development of their

4 141 businesses. Three steps involved are Exploration, Pattern identification and Deployment Exploration: In the first step of data exploration data is cleaned and transformed into another form, and important variables and then nature of data based on the problem are determined. Pattern Identification: Once data is explored, refined and defined for the specific variables the second step is to form pattern identification. Identify and choose the patterns which make the best prediction. Deployment: Patterns are deployed for desired outcome. 5.4 KNOWLEDGE DISCOVERY IN DATABASES (KDD) Traditional methods use manual analysis to find patterns or extract knowledge from the database. For example in the case of health care, the health organizations analyze the trends in diseases and the occurrence rates. This helps health organizations take precautions in future in decision making and planning of health care management. The traditional method is used to analyze data manually for patterns for the extraction of knowledge. Take any field like banking, mechanic, healthcare, and marketing; there will always be a data analyst to work with the data and analyzing the final results. The analyst acts like an interface between the data and knowledge. We can, using machine intelligence assist the analyst to produce similar results or knowledge from the data. When we encounter patterns within a database we state the findings (patterns or rules) as data mining, information retrieval or knowledge extraction and so on. The term data mining is used mostly by statisticians, data analysts and the management information systems (MIS). The difference

5 142 between data mining and knowledge discovery is that the latter is the application of different intelligent algorithms to extract patterns from the data whereas knowledge discovery is the overall process that is involved in discovering knowledge from data. There are other steps such as data preprocessing, data selection, data cleaning, and data visualization, which are also a part of the KDD process. Figure 5.1 Knowledge Data Discovery Process The Knowledge Discovery Process Knowledge discovery is the process of automatically generating information formalized in a form understandable to humans (Wuthrich 1995). Data mining is one of the tasks in the process of knowledge discovery from the database. The steps in the KDD process contains (Frawley 1991): Data cleaning: It is also known as data cleansing; in this phase noise data and irrelevant data are removed from the collection.

6 143 Data integration: In this stage, multiple data sources, often heterogeneous, are combined in a common source. Data selection: The data relevant to the analysis is decided on and retrieved from the data collection. Data transformation: It is also known as data consolidation; in this phase the selected data is transformed into forms appropriate for the mining procedure. Data mining: It is the crucial step in which clever techniques are applied to extract potentially useful patterns. Pattern evaluation: In this step, interesting patterns representing knowledge are identified based on given measures. Knowledge representation: It is the final phase in which the discovered knowledge is visually presented to the user. This essential step uses visualization techniques to help users understand and interpret the data mining results. 5.5 THE DATA MINING TASKS The data mining tasks are of different types depending on the use of data mining result the data mining tasks are classified as, Exploratory Data Analysis: It is simply exploring the data without any clear ideas of what we are looking for. These techniques are interactive and visual. Descriptive Modeling: It describe all the data, It includes models for overall probability distribution of the data,

7 144 partitioning of the p-dimensional space into groups and models describing the relationships between the variables. Predictive Modeling: This model permits the value of one variable to be predicted from the known values of other variables. Discovering Patterns and Rules: It concern with pattern detection, the aim is spotting fraudulent behavior by detecting regions of the space defining the different types of transactions where the data points significantly different from the rest. Retrieval by Content: It is finding pattern similar to the pattern of interest in the data set. This task is most commonly used for text and image data sets. 5.6 TYPES OF DATA MINING SYSTEM Data mining systems can be categorized according to various criteria the classification is as follows, Classification of data mining systems according to the type of data source mined: In an organization a huge amount of data s are available where we need to classify these data but these are available most of times in a similar fashion. we need to classify these data according to its type(maybe audio/video,text format etc.) Classification of data mining systems according to the data model: There are so many number of data mining

8 145 models (Relational data model, Object Model, Object Oriented data Model, Hierarchical data Model/W data model )are available and each and every model we are using the different data.according to these data model the data mining system classify the data in the model. Classification of data mining systems according to the kind of knowledge discovered: This classification based on the kind of knowledge discovered or data mining functionalities, such as characterization, discrimination, association, classification, clustering, etc. Some systems tend to be comprehensive systems offering several data mining functionalities together. Classification of data mining systems according to mining techniques used: This classification is according to the data analysis approach used such as machine learning, neural networks, genetic algorithms, statistics, visualization, database oriented or data warehouse-oriented, etc. The classification can also take into account the degree of user interaction involved in the data mining process such as querydriven systems, interactive exploratory systems, or autonomous systems. A comprehensive system would provide a wide variety of data mining techniques to fit different situations and options, and offer different degrees of user interaction. 5.7 THE DATA MINING LIFE CYCLE The life cycle of a data mining project consists of six phases. The sequence of the phases is not rigid. Moving back and forth between different

9 146 phases is always required. It depends on the outcome of each phase. The main phases are, Business Understanding: This phase focuses on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a data mining problem definition and a preliminary plan designed to achieve the objectives. Data Understanding: It starts with an initial data collection, to get familiar with the data, to identify data quality problems, to discover first insights into the data or to detect interesting subsets to form hypotheses for hidden information. Data Preparation: In this stage, it collects all the different data sets and construct the varieties of the activities basing on the initial raw data Modeling: In this phase, various modeling techniques are selected and applied and their parameters are calibrated to optimal values. Evaluation: In this stage the model is thoroughly evaluated and reviewed. The steps executed to construct the model to be certain it properly achieves the business objectives. At the end of this phase, a decision on the use of the data mining results should be reached. Deployment: The purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that the customer can use it.

10 147 The deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise. 5.8 THE DATA MINING MODELS Descriptive. The data mining models are of two types: Predictive and The predictive model makes prediction about unknown data values by using the known values. Ex. Classification, Regression, Time series analysis, Prediction etc. The descriptive model identifies the patterns or relationships in data and explores the properties of the data examined. Ex. Clustering, Summarization, Association rule, Sequence discovery etc. Many of the data mining applications are aimed to predict the future state of the data. Prediction is the process of analyzing the current and past states of the attribute and prediction of its future state. Classification is a technique of mapping the target data to the predefined groups or classes, this is a supervise learning because the classes are predefined before the examination of the target data. The regression involves the learning of function that map data item to real valued prediction variable. In the time series analysis the value of an attribute is examined as it varies over time. In time series analysis the distance measures are used to determine the similarity between different time series, the structure of the line is examined to determine its behavior and the historical time series plot is used to predict future values of the variable.

11 148 Clustering is similar to classification except that the groups are not predefined, but are defined by the data alone. It is also referred to as unsupervised learning or segmentation. It is the partitioning or segmentation of the data in to groups or clusters. The clusters are defined by studying the behavior of the data by the domain experts. The term segmentation is used in very specific context; it is a process of partitioning of database into disjoint grouping of similar tuples. Summarization is the technique of presenting the summarize information from the data. The association rule finds the association between the different attributes. Association rule mining is a twostep process: Finding all frequent item sets, Generating strong association rules from the frequent item sets. Sequence discovery is a process of finding the sequence patterns in data. This sequence can be used to understand the trend. 5.9 DATA MINING METHODS The data mining methods are broadly categorised as below. These methods use different types of algorithms and data. The data source can be data warehouse, database, flat file or text file. The algorithms may be Statistical Algorithms, Decision Tree based, Nearest Neighbor, Neural Network based, Genetic Algorithms based, Ruled based, Support Vector Machine etc. The selection of data mining algorithm is mainly depends on the type of data used for mining and the expected outcome of the mining process. The domain experts play a significant role in the selection of algorithm for data mining. Classification Clustering Association Rule Mining Temporal Data Mining

12 149 Time Series Analysis Spatial Mining Web Mining A knowledge discovery (KD) process involves preprocessing data, choosing a data mining algorithm, and post processing the mining results. There are very many choices for each of these stages, and non-trivial interactions between them. Therefore both novices and data-mining specialists need assistance in knowledge discovery processes. The Intelligent Discovery Assistants (IDA), helps users in applying valid knowledge discovery processes. The IDA can provide users with three benefits: A systematic enumeration of valid knowledge discovery processes; Effective rankings of valid processes by different criteria, which help to choose between the options; An infrastructure for sharing knowledge, which leads to network externalities. Several other attempts have been made to automate this process and design of a generalized data mining tool that possess intelligence to select the data and data mining algorithms and up to some extent the knowledge discovery DATA MINING ALGORITHMS AND TECHNIQUES Various algorithms and techniques like Classification, Clustering, Regression, Artificial Intelligence, Neural Networks, Association Rules,

13 150 Decision Trees, Genetic Algorithm, Nearest Neighbor method etc., are used for knowledge discovery from databases Classification Classification is the most commonly applied data mining technique, which employs a set of pre-classified examples to develop a model that can classify the population of records at large. Fraud detection and credit risk applications are particularly well suited to this type of analysis. This approach frequently employs decision tree or neural network-based classification algorithms. The data classification process involves learning and classification. In Learning the training data are analyzed by classification algorithm. In classification test data are used to estimate the accuracy of the classification rules. If the accuracy is acceptable the rules can be applied to the new data tuples. For a fraud detection application, this would include complete records of both fraudulent and valid activities determined on a record-by-record basis. The classifier-training algorithm uses these preclassified examples to determine the set of parameters required for proper discrimination. The algorithm then encodes these parameters into a model called a classifier. Types of classification models Classification by decision tree induction Bayesian Classification Neural Networks Support Vector Machines (SVM) Classification Based on Associations

14 Clustering Clustering can be said as identification of similar classes of objects. By using clustering techniques we can further identify dense and sparse regions in object space and can discover overall distribution pattern and correlations among data attributes. Classification approach can also be used for effective means of distinguishing groups or classes of object but it becomes costly so clustering can be used as preprocessing approach for attribute subset selection and classification. For example, to form group of customers based on purchasing patterns, to categories genes with similar functionality. Types of clustering methods Partitioning Methods Hierarchical Agglomerative (divisive) methods Density based methods Grid-based methods Model-based methods Predication Regression technique can be adapted for predication. Regression analysis can be used to model the relationship between one or more independent variables and dependent variables. In data mining independent variables are attributes already known and response variables are what we want to predict. Unfortunately, many real-world problems are not simply prediction. For instance, sales volumes, stock prices, and product failure rates are all very difficult to predict because they may depend on complex interactions of multiple predictor variables. Therefore, more complex

15 152 techniques (e.g., logistic regression, decision trees, or neural nets) may be necessary to forecast future values. The same model types can often be used for both regression and classification. For example, the CART (Classification and Regression Trees) decision tree algorithm can be used to build both classification trees (to classify categorical response variables) and regression trees (to forecast continuous response variables). Neural networks too can create both classification and regression models. Types of regression methods Linear Regression Multivariate Linear Regression Nonlinear Regression Multivariate Nonlinear Regression Association Rule Association and correlation is usually to find frequent item set findings among large data sets. This type of finding helps businesses to make certain decisions, such as catalogue design, cross marketing and customer shopping behavior analysis. Association Rule algorithms need to be able to generate rules with confidence values less than one. However the number of possible Association Rules for a given dataset is generally very large and a high proportion of the rules are usually of little (if any) value. Types of association rule Multilevel association rule Multidimensional association rule Quantitative association rule

16 Neural Networks Neural network is a set of connected input/output units and each connection has a weight present with it. During the learning phase, network learns by adjusting weights so as to be able to predict the correct class labels of the input tuples. Neural networks have the remarkable ability to derive meaning from complicated or imprecise data and can be used to extract patterns and detect trends that are too complex to be noticed by either humans or other computer techniques. These are well suited for continuous valued inputs and outputs. For example handwritten character reorganization, for training a computer to pronounce English text and many real world business problems and have already been successfully applied in many industries. Neural networks are best at identifying patterns or trends in data and well suited for prediction or forecasting needs. Types of neural networks Back Propagation 5.11 DATA MINING APPLICATIONS The data mining applications can be generic or domain specific. The generic application is required to be an intelligent system that by its own can takes certain decisions like: selection of data, selection of data mining method, presentation and interpretation of the result. Some generic data mining applications cannot take its own these decisions but guide users for selection of data, selection of data mining method and for the interpretation of the results. Multi Agent System Tool used for generic data mining system development uses different agents to perform different tasks. A multi-tier data mining system is proposed to enhance the performance of the data mining

17 154 process. It has basic components like user interface, data mining services, data access services and the data. system namely There are three different architectures presented for the data mining One-Tire architecture. Two-Tire architecture. Three-Tier architecture. The pharmaceutical industry is well known for performing quantitative analysis for clinical research and market research. In the marketing departments data mining applications are used for sales force planning and direct marketing to doctors and consumers. Data mining techniques used quite well to a variety of critical business decisions in the pharmaceutical industry. It also used for forecasting production schedules for the manufacturing plants, determining market potential in critical go/no decisions on continuing work on development compounds, or making financial projections for stock holders and investors. In this section, we have focused some of the applications of data mining and its techniques are analyzed respectively Order. Data Mining Applications in Healthcare Data mining applications in health can have tremendous potential and usefulness. However, the success of healthcare data mining hinges on the availability of clean healthcare data. In this respect, it is critical that the healthcare industry look into how data can be better captured, stored, prepared and mined. Possible directions include the standardization of clinical

18 155 vocabulary and the sharing of data across organizations to enhance the benefits of healthcare data mining applications Future Directions of Health care system through Data Mining Tools As healthcare data are not limited to just quantitative data (e.g., doctor s notes or clinical records), it is necessary to also explore the use of text mining to expand the scope and nature of what healthcare data mining can currently do. This is specially used to mixed all the data and then mining the text. It is also useful to look into how images (e.g., MRI scans) can be brought into healthcare data mining applications. It is noted that progress has been made in these areas Data mining is used for market basket analysis Data mining technique is used in MBA(Market Basket Analysis).When the customer want to buying some products then this technique helps us finding the associations between different items that the customer put in their shopping buckets. Here the discovery of such associations that promotes the business technique.in this way the retailers uses the data mining technique so that they can identify that which customers intension (buying the different pattern).in this way this technique is used for profits of the business and also helps to purchase the related items. The data mining is used an emerging trends in the education system in the whole world In Indian culture most of the parents are uneducated.the main aim of in Indian government is the quality education not for quantity. But the day by day the education systems are changed and in the 21st century a huge number of universalities are established by the order of UGC. As the numbers

19 156 of universities are established side by side, each and every day a millennium of students are enrolls across the country. With huge number of higher education aspirants, we believe that data mining technology can help bridging knowledge gap in higher educational systems. The hidden patterns, associations, and anomalies that are discovered by data mining techniques from educational data can improve decision making processes in higher educational systems. This improvement can bring advantages such as maximizing educational system efficiency, decreasing student's drop-out rate, and increasing student's promotion rate, increasing student's retention rate in, increasing student's transition rate, increasing educational improvement ratio, increasing student's success, increasing student's learning outcome, and reducing the cost of system processes. In this current era we are using the KDD and the data mining tools for extracting the knowledge this knowledge can be used for improving the quality of education.the decisions tree classification is used in this type of applications. Data mining is now used in many different areas in manufacturing engineering When we retrieve the data from manufacturing system then the customer is to use these data for different purposes like to find the errors in the data,to enhance the design methodology,to make the good quality of the data,how best the data can be supported for making the decision. But most of time the data can be first analyzed then after find the hidden patterns which will be control the manufacturing process which will further enhance the quality of the products.since the importance of data mining in manufacturing has clearly increased over the last 20 years, it is now appropriate to critically review its history and Application.

20 157 Future Directions in the manufacturing Engineering through the Data mining Tools It is very tedious task to mine the manufacturing data.generally when we mine the data in the manufacturing, we does not give more important to the quality of the rules.after mining those knowledge which has generated is very difficult because relationship identification is too complex to understand. That s why we need the further to enhance the research methodology to know the proper knowledge. The new methodology was proposed i.e CRISP-DM which will provides the high level detail steps of instructions for using the data mining in the engineering field. Further research is needed to develop generic guidelines for a variety of different data and types of problems, which are commonly faced by manufacturing engineering industry. Data Mining Applications can be generic or domain specific Data mining system can be applied for generic or domain specific. Some generic data mining applications cannot take its own these decisions but guide users for selection of data, selection of data mining method and for the interpretation of the results. The multi agent based data mining application has capability of automatic selection of data mining technique to be applied. The Multi Agent System used at different levels. First, at the level of concept hierarchy definition then at the result level to present the best adapted decision to the user. This decision is stored in knowledge Base to use in a later decision-making. A multi-tier data mining system is proposed to enhance the performance of the data mining process It has basic components like user interface, data mining services, data access services and the data. There are three different architectures

21 158 presented for the data mining system namely one-tire, Two-tire and Three-tier architecture. Generic system required to integrate as many learning algorithms as possible and decides the most appropriate algorithm to use. CORBA (Common Object Request Broker Architecture) has features like: Integration of different applications coded in any programming language considerably easy. It allows reusability in a feasible way and finally it makes possible to build large and scalable system. The data mining system architecture based on CORBA is given by Object Management Group has all characteristics to accomplish a distributed and object oriented computation. A data-centric focus and automated methodologies makes data mining accessible to no experts. The use of high-level interfaces can implement the automated methodologies that hide the data mining concepts away from the users. A data-centric design hides away all the details of mining methodology and exposes them through high-level tasks that are goal-oriented. These goaloriented tasks are implemented using data-centric APIs. This design makes data mining task like other types of queries that users perform on the data. In data mining better results could be obtained if large data is available. It leads to the merging and linking of local databases. A new data-mining architecture based on Internet technology addressed this problem. The context factor plays vital role in the success of data mining. The importance and meaning of same data in the different context is different. A data in one context is very important may not be much important in other context. A context-aware datamining framework filters useful and interesting context factors, and can produce accurate and precise prediction using those factors. Application of Data Mining techniques in CRM Data mining technique is used in CRM.Now a days it is one of the hot topic to research in the industry because CRM have attracted both the

22 159 practitioners and academics. It aims to give a research summary on the application of data mining in the CRM domain and techniques which are most often used. Although this review cannot claim to be exhaustive, it does provide reasonable insights and shows the incidence of research on this subject. The results presented in this paper have several important implications: Research on the application of data mining in CRM will increase significantly in the future based on past publication rates and the increasing interest in the area. The majority of the reviewed articles relate to customer retention THE DOMAIN SPECIFIC APPLICATIONS The domain specific applications are focused to use the domain specific data and data mining algorithm that targeted for specific objective. The applications studied in this context are aimed to generate the specific knowledge. In the different domains the data generating sources generate different type of data. Data can be from a simple text, numbers to more complex audio-video data. To mine the patterns and thus knowledge from this data, different types of data mining algorithms are used. The collection and selection of context specific data and applying the data mining algorithm to generate the context specific knowledge is thus a skillful job. In many domains specific data mining applications the domain experts plays vital role to mine useful knowledge. In the identification of foreign-accented French the audio files were used and the best 20 data mining algorithms were applied the Logistic Regression model found the most robust algorithm than other algorithm. In language research and language engineering A linguistic profile that contains large number of linguistic features can be generated from text file automatically using data mining. This

23 160 technique found quite effective for authorship verification and recognition. A profiling system using combination of lexical and syntactic features shows 97% accuracy in selecting correct author for the text. The linguistic profiling of text effectively used to control the quality of language and for the automatic language verification. This method verifies automatically the text is of native quality. The results show that language verification is indeed possible. In Medical Science There is large scope for application of data mining in medical science like diagnosis of diesis, health care, patient profiling and history generation etc. Mammography is the method used in breast cancer detection, in which CAM (Computer Aided Methods) could help radiologists in detection of tumors without any difficulty. The neural networks with backpropagation and association rule mining used for tumor classification in mammograms. Datamining is effectivey used in diagnosis of lung abnormality (that may be cancerous or benign). The data mining algorithms significantly reduce patient s risks and diagnosis costs. Using the prediction algorithms the observed prediction accuracy was 100% for 91.3% cases. The medical data is complex and difficult to analyze. Hence the use of data mining in health care is most important. A REMIND (Reliable Extraction and Meaningful Inference from Non-structured Data) system integrates the structured and unstructured clinical data in patient records which inturn automatically creates high quality structured clinical data. To adopt the high quality technique the existing patient records to be mined to support guidelines and give compliance to improve patient care.

24 161 Data mining methods are used in the Web Education Data mining methods are used in the web Education which is used to improve courseware. The relationships are discovered among the usage data picked up during students sessions. This knowledge is very useful for the teacher or the author of the course, who could decide what modifications will be the most appropriate to improve the effectiveness of the course. In the 21st century the beginners are using the data mining techniques which are one of the best learning method in this era. This makes it possible to increase the awareness of learners. Web Education which will rapidly growth in the application of data mining methods to educational chats which is both feasible and can be improvement in learning environments in the 21st century. Credit Scoring Credit scoring has become very important issue due to the recent growth of the credit industry, so the credit department of the bank faces the huge numbers of consumers' credit data to process, but it is impossible analyzing this huge amount of data both in economic and manpower terms. In this study we reviewed the papers which have applied data mining methods in credit risk evaluation problem. Ten data mining technique which were most used method in the credit risk evaluation context were extracted, and then we searched almost all papers which had focused on these ten methods form 2000 to It is concluded that the support vector machine has been widely applied in recent years and which is one of the best technique.. Since to improve the performance of this model, it is necessary a method for reduction the feature subset, many hybrid SVM based model are proposed. Moreover the hybrid models have been attended in the last decade because of its enjoying from advantages of two or more models. Many of these proposed models can only classify customers into two classes good or

25 162 bad ones. Several single and hybrid data mining methods are applied for credit scoring problem. The most used applied methods for doing credit scoring task are derived from classification technique. Generally classification is used when we predict something which is possible by using the previous available information. It is one type of methods which can be defined as classification where the members of a given set of instances into some groups where the different types of characteristics are to be made. Classification task is very suited to data mining methods and techniques. The Intrusion Detection in the Network The intrusion detection in the Network is very difficult and needs a very close watch on the data traffic. The intrusion detection plays an essential role in computer security. The classification method of data mining is used to classify the network traffic normal traffic or abnormal traffic. If any TCP header does not belong to any of the existing TCP header clusters, then it can be considered as anomaly. A malicious Executable is Threat A malicious executable is threat to system s security, it damage a system or obtaining sensitive information without the user s permission. The data mining methods used to accurately detect malicious executable before they run. Classification algorithms RIPPER, Naïve Bayes, and a Multi- Classifier system are used to detect new malicious executable. This classifier had shown detection rate 97.76%. Sports data Mining The data mining and its technique is used for an application of Sports center. Data mining is not only use in the business purposes but also it

26 163 used in the sports.in the world, a huge number of games are available where each and every day the national and international games are to be scheduled, where a huge number of data s are to be maintained.the data mining tools are applied to give the information as and when we required. The open source data mining tools like WEKA and RAPID MINER frequently used for sport. This means that users can run their data through one of the built-in algorithms, see what results come out, and then run it through a different algorithm to see if anything different stands out. As these programs are available in the form of open source in nature, that s why the users are frequently to modify the source code, so that other can get the updated information. In the sports world the vast amounts of statistics are collected for each player, team, game, and season. In the game sports the data s are available in the form of statistical form where data mining can be used and discover the patterns, these patterns are often used to predict the future forecast. Data mining can be used for scouting, prediction of performance, selection of players, coaching and training and for the strategy planning. The data mining techniques are used to determine the best or the most optimal squad to represent a team in a team sport in a season, tour or game. The Cy Young Award has been presented annually to the best pitcher in the major league of baseball. The award is based largely on statistics compiled over the course of the baseball season. A Bayesian classifier is developed to predict Cy Young Award winners in American major league baseball. The Intelligence Agencies The Intelligence Agencies collect and analyze information to investigate terrorist activities. One challenge to law enforcement and intelligent agencies is the difficulty of analyzing large volume of data involve in criminal and terrorist activities. Now a days the intelligence agency are

27 164 using the sophisticated data mining algorithms which makes it easy, to handle the very large data bases databases for organizations. The different data mining techniques are used in crime data mining. Though the organizations have used large data bases but data mining helps us to generate the different types of information in the organization like personal details of the persons along with, vehicle details.in data mining the Clustering techniques is used (Association rule mining) for the different objects(like persons, organizations, vehicles etc.) in crime records. Not only data mining detects but also analyzes the crime data. The classification technique is also used to detect spamming and also find person who has given the mail. String comparator is used to detect deceptive information in criminal record. The data mining system implemented at the Internal Revenue Service The data mining system implemented at the Internal Revenue Service to identify high-income individuals engaged in abusive tax shelters show significantly good results. The major lines of investigation included visualization of the relationships and data mining to identify and rank possibly abusive tax avoidance transactions. To enhance the quality of product data mining techniques can be used effectively. The data mining technology SAS/EM is used to discover the rules those are unknown before and it can improve the quality of products and decrease the cost. A regression model and the neural network model when applied for this purpose given accuracy above 80%. The neural network model found better than the regression model. E-commerce is also the most prospective E-commerce is also the most prospective domain for data mining. It is ideal because many of the ingredients required for successful data mining are easily available: data records are plentiful, electronic collection provides

28 165 reliable data, insight can easily be turned into action, and return on investment can be measured. The integration of e-commerce and data mining significantly improve the results and guide the users in generating knowledge and making correct business decisions. This integration effectively solves several major problems associated with horizontal data mining tools including the enormous effort required in pre-processing of the data before it can be used for mining, and making the results of mining actionable. The Digital Library Retrieves The data mining application can be used in the field of the Digital Library where the user will find or collect, store and preserve the data which are in the form of digital mode. The advent of electronic resources and their increased use in libraries has brought about significant changes in Library. The data and information are available in the different formats. These formats include Text, Images, Video, Audio, Picture, Maps, etc. therefore digital library is a suitable domain for application of data mining The prediction in engineering applications The prediction in engineering applications was treated effectively by a data mining approach. The prediction problems is the cost estimation problem in engineering, the problem of engineering design that involves decisions where parameters, actions, components, and so on are selected. data mining technique is used for the variety of the parameters in the field of engineering applications like prior data.once we gather the data then we can generate the different models,algorithms which will predict the different characteristic. The data mining algorithm applied on the test file with nine features has produced 100% correct predictions. Several other applications are studied in this context.

29 APPLICATIONS OF DATA MINING IN THE PHARMACEUTICAL INDUSTRY Data Mining is the process of extracting information from large data sets through the use of algorithms and techniques drawn from the field of Statistics, Machine Learning and Data Base Management Systems (Feelders, Daniels and Holsheimer 2000). Traditional data analysis methods often involve manual work and interpretation of data that is slow, expensive and highly subjective (Fayyad et al 1996). Data Mining, popularly called as knowledge discovery in large data, enables firms and organizations to make calculated decisions by assembling, accumulating, analyzing and accessing corporate data. It uses variety of tools like query and reporting tools, analytical processing tools, and Decision Support System (DSS) tools. Recently, numerous experts have predicted that revenue growth for the pharmaceutical industry will slow from the healthy 12% rate to a much slower 5-6% rate. (Cosper Nate 2003) describes this trend, which is becoming increasingly accepted, has numerous implications for the drug discovery technologies companies. Most significantly, slower revenue growth will necessitate decreased expenditures throughout the organization. Many drug discovery technology companies are attempting to address this challenge through developing solutions that will force new drugs to "fail faster and safer." Although this is a noble goal and if realized, would definitely create value for the industry, these solutions often over promise and underestimate the obstacles that stand in the path to lower clinical failure rates. Marketing strategies centered on increasing revenues will be more convincing than those that address reducing expense. Demonstrating that technologies will enable pharmaceutical companies to better target and market to certain customer segments will increase adoption of that technology, and will open the door for projects aimed at reducing costs and increasing clinical trials throughput.

30 167 The importance of decision support in the delivery of managed healthcare can hardly be overemphasized (Hampshire and Rosborough 1993). A variety of decision support capabilities will be necessary to increase the productivity of medical personnel, analyze care outcomes, and continually refine care delivery processes to remain profitable while holding the line on costs and maintaining quality of care (Dutta and Heda 2000). Healthcare decision support is faced with the challenges of complex and diverse data and knowledge forms and tasks (Prins and Stegwee 2000; Sheng 2000), the lack of standardized terminology compared to basic sciences, the stringent performance and accuracy requirements and the prevalence of legacy systems (Sheng 2000). Data mining the life sciences researcher to mine data to understand safety and efficacy profiles within the patient population. By tackling the question of patient selection within the framework of demonstrating groups that are most responsive, Data mining is sure to penetrate the drug development marketplace. Data mining framework enables specialists to create customized nodes that can be shared throughout the organization, making the application attractive to skilled modelers in a pharmaceutical company s bioinformatics division. Most healthcare institutions lack the appropriate information systems to produce reliable reports with respect to other information than purely financial and volume related statements (Prins and Stegwee 2000). The management of pharma industry starts to recognize the relevance of the definition of drugs and products in relation to management information. In the turmoil between costs, care-results and patient satisfaction the right balance is needed and can be found in upcoming information and Communication technology. The delivery of healthcare has always been information intensive, and there are signs that the industry is recognizing the increasing importance of information processing in the new managed care environment (Morrisey,

31 ). Most automated systems are used as a tool for daily work: they are focused on production (daily registration). All the data, which are used to keep the organization running, operational data, are in these automated systems. These systems are also called legacy systems. There is a growing need to do more with the data of an organization than to use them for administration only. A lot of information is hidden in the legacy systems. This information can easily be extracted. Most of the times this cannot be done directly from the legacy systems, because these are not build to answer questions that are unpredictable. Research shows that (Zuckerman and Alan 2006); Armoni 2002; Rada 2002) that successful decision systems enriched with analytical solutions are necessary for healthcare information systems. Given the size of the databases being queried, there is likely to be a trade-off in accuracy of information and processing time. Sampling techniques and tests of significance may be satisfactory to identify some of the more common relationships; however, uncommon relationships may require substantial search time. The thoroughness of the search depends on the importance of the query (e.g., life threatening vs. "curious to know"), the indexing structures used, and the level of detail supplied in the query. Of course, the real data mining challenge comes when the user supplies only a minimal amount of information. For example: find possible serious side effects (not necessarily reported in the manufacturer s product literature) involving food and any type or brand of antacid. A user-interface may be designed to accept all kinds of information from the user (e.g., weight, sex, age, foods consumed, reactions reported, dosage, and length of usage). Then, based upon the information in the databases and the relevant data entered by the user, a list of warnings or known reactions (accompanied by

32 169 probabilities) should be reported. Note that user profiles can contain large amounts of information, and efficient and effective data mining tools need to be developed to probe the databases for relevant information. Secondly, the patient's (anonymous) profile should be recorded along with any adverse reactions reported by the patient, so that future correlations can be reported. Over time, the databases will become much larger, and interaction data for existing medicines will become more complete. The amount of existing pharmaceutical information (pharmacological properties, dosages, contraindications, warnings, etc.) is enormous; however, this fact reflects the number of medicines on the market, rather than an abundance of detailed information about each product. One of the major problems with pharmaceutical data is actually a lack of information. For example, an food and drug administration department estimated that only about 1% of serious events are reported to the food and drug administration department. Fear of litigation may be a contributing factor; however, most health care providers simply don't have the time to fill out reports of possible adverse drug reactions. Furthermore, it is expensive and time consuming for pharmaceutical companies to perform a thorough job of data collection, especially when most of the information is not required by law. Finally, one should note that the food and drug administration department does not require manufacturers to test new medicines for potential interactions. There are in general three stages of drug development namely finding of new drugs, development tests and predicts drug behavior, clinical trials test the drug in humans and commercialization takes drug and sells it to likely consumers (doctors and patients).

33 Development of New Drugs This research need to use data mining tools and techniques. This can be achieved by clustering the molecules into groups according to the chemical properties of the molecules via cluster analysis (Cooman 2005). This way every time a new molecule is discovered it can be grouped with other chemically similar molecules. This would help the researchers in finding out with therapeutic group the new molecule would belong to. Mining can help us to measure the chemical activity of the molecule on specific disease say tuberculosis and find out which part of the molecule is causing the action. This way we can combine a vast number of molecules forming a super molecule with only the specific part of the molecule which is responsible for the action and inhibiting the other parts. This would greatly reduce the adverse effects associated with drug actions. Scientists run experiments to determine activity of potential drugs. They use high speed screening to test tens, hundreds, or thousands of drugs very quickly. The general goal is to find activity on relevant genes or to find drug compounds that have desirable characteristics (whatever those may be). The Data mining techniques that are used in developing of new drugs are clustering, classification and neural networks. The basic objective is to determine compounds with similar activity. The reason is for similar activity compounds behave similarly. This is possible only when we have known compound and looking for something better. When we don t have known compounds but have desired activity and want to find compound that exhibits this activity, then data mining rescues this Development Tests and Predicts Drug Behavior There many issues which affect the success of a drug which has been marketed which can impact the future development of the drug. Firstly

34 171 adverse reactions to the drugs are reported spontaneously and not in any organized manner. Secondly we can only compare the adverse reactions with the drugs of our own company and not with other drugs from competing firms. And thirdly we only have information on the patient taking the drug not the adverse reaction that the patient is suffering from. All this can be solved with creation of a data warehouse for drug reactions and running business intelligence tools on them a basic classification tool can solve much of the problems faced here. We could find out the adverse reactions associated with a specific drug and still go a step further to show if any specific condition aggravates the adverse reaction for example age, sex, and obesity (Novartis Business Intelligence report, 2004). This could help the medical practitioner to describe the side effects to the patients being prescribed these drugs. Pharma companies think that drugs might have some yielded benefits. The drug undergoes testing in animals and human tissue to observe effect and determines how much drug to consume for desired effect or how dangerous is the drug. The Data mining techniques can be here used is classification and neural networks. The goal here is to predict if treatment will aid patients. Because if drug will not aid patients, what purpose does drug serve. Predicting the drug behavior is essential when we have data supporting use of drug and also have training data that shows effects of drug (positive or negative). The test should be able to predict which patients will benefit and which treatment help sickle cell anemia patients. The information like gender, body weight, disease state, etc. will play crucial role. This crucial data should be fed into neural network and predict whether patient will benefit from drug. Only one of two classifications yes/no will be available on training data. Network is trained for the yes classifications and a snapshot is taken of the neural network. Then network is trained for the no classifications and another snapshot is taken. The output is yes or no, depending on whether the inputs are more similar to the yes or the no training data.

35 Clinical Trials Test the Drug in Humans Company tests drugs in actual patients on larger scale. The company has to keep track of data about patient progress. The Government wants to protect health of citizens, many rules govern clinical trials. In developed countries food and drug administration oversees trials. The Data mining techniques used here can be neural networks. Here data is collected by pharmaceutical company but undergoes statistical analysis to determine success of trial. Data is generally reported to food and drug administration department and inspected closely. Too many negative reactions might indicate drug is too dangerous. An adverse event might be medicine causing drowsiness. As a matter of fact, Data mining is performed by food and drug administration, not as much by pharmaceutical companies. The goal is to detect when too many adverse events occur or detect link between drug and adverse event. Too many adverse events linked to a drug might indicate drug is too dangerous or health of patient is at risk. Adverse events are reported to food and drug administration when link is suspected. One can feed the information on drug causing too many adverse events pertaining to drugs into a neural network and let network lead us to what is meant by too many TECHNIQUES OF DATA MINING USED IN PHARMA SECTOR Pharma industries rely on decision oriented, systemic selection models that enable the decision maker to evaluate the payoff that is expected to result from the implementation of a proposed selection program. Such models go beyond an examination of the size of the validity coefficient and take a host of issues such as capital budgeting and strategic outcomes at the group and organizational levels. Many organizations generate mountains of data about their new drugs discovered and its performance reports, etc. This

36 173 data is a strategic resource. Now, making use of most of these strategic resources will lead to improving the quality of pharma industries. The main techniques of data mining used in pharma sector are clustering, classification and neural networks. Table 5.1 provides a clear understanding of how data mining techniques will really add value to pharma data. Table 5.1 Major Data Mining Techniques and Patterns Major data mining techniques Clustering Classification and prediction Patterns Diseases having similar characteristics Grouping top drugs Analysis of workforce attrition in pharma sector Groups of drugs most likely to be used, most unlikely to be used Segments related to drugs, diseases Predicting consumer behavior Predicting the likelihood of success in a drug adoption process Predicting the percentage accuracy in performance of a drug Analyze forecast and model information to quantify human capital assets Classifying the historical health records Prediction of what type of drugs most likely to be retained, most likely to be left, most likely to transform their composition Predicting pharma product behavior and attitude

37 174 Table 5.1 (Continued) Major data mining techniques Association, Mining Frequent Patterns Data mining using other inter-disciplinary methods Patterns Predicting demand projections by seasonal variations Predicting the performance progress of segments throughout the performance period Identifying the best profile for different drugs Prediction to find what factors will attract new avenues in pharma sector Classify trends of movements through the organization for successful/unsuccessful patient historical records Categorization of drugs, diseases and patients Association of training undertaken diseases with drugs Association and analysis of staff movements Application tracking mechanism in physicians adopting drugs with customer s prescription Standardizing training methods, task performance monitoring, monitor workflow route tasks Provides summary information: various multidimensional summary reports Statistical summary information (data central tendency and variation) Use historical data to build models of fraudulent behavior and use data mining to help identify similar instances

38 Clustering It is a method by which similar records are grouped together. Clustering is usually used to mean segmentation. An organization can take the hierarchy of classes that group similar events. Using clustering, employees can be grouped based on income, age, occupation, housing etc. In business, clustering helps identify groups of similarities; characterize customer groups based on purchasing patterns, etc Classification and Prediction The classification and prediction models are two data analysis techniques that are used to describe data classes and predict future data classes. A credit card company whose customer credit history is known can classify its customer record as Good, Medium, or Poor. Similarly, the income levels of the customer can be classified as High, Low, and Medium. The models of decision trees, neural networks based classifications schemes are very much useful in pharma industry. Classification works on discrete and unordered data, while prediction works on continuous data. Regression is often used as it is a statistical method used for numeric prediction. Primary emphasis should be made on the selection measurement accuracy and predicative efficiency of any new drug discovery. Simple or multiple regressions is the basic prediction model that enables a decision maker to forecast each criterion status based on predictor information. (Smith and Gupta, 2002) show through case studies how neural network technology is useful from different areas of business Associations, Mining Frequent Patterns These methods identify rules of affinities among the collections. (Hand et al 2001) mention that patterns occur frequently during Data Mining

39 176 process. The applications of association rules include market basket analysis, attached mailing in direct marketing, fraud detection, department store floor/shelf planning etc DATA MINING IN THE PHARMA SECTOR: BENEFITS The identification and quantification of pharmaceutical information can be extremely useful for patients, physicians, pharmacists, health organizations, insurance companies, regulatory agencies, investors, lawyers, pharmaceutical manufacturers, drug testing companies, etc. Through data mining techniques in the healthcare area, an easy to use decision support platform that supports decision-making process of care givers and clinical managers can be built. This is a new direction in modern health care as it commences an innovative application field of information technology in healthcare industry. Medical institutions as well as health insurance companies have primarily interest in increasing the patient healing rate and reducing treatment costs. Application of data mining could prove economical in the long term by avoiding the duplication of examinations, time saved through automation of routine tasks and the simplification of accounting and administrative procedures. Data mining brings alerts at the point of patient care and supports clinical or business strategies and provides standing or ad hoc reports for care givers (clinicians, physicians, nurses) and other decisionmakers (clinical management, human resources). The contribution of this paper is to demonstrate that the data mining approach constitutes a reliable and efficient carrier of the clinical decision support system. Pharmaceutical enterprises leverage information technology to exceed business targets. Research and development for discovering a new drug is a very costly affair, involving very high investments in human resources, costs of clinical trials and the prolonged time factor. Also, out of many leads under trial, only one or two can actually be converted successfully

40 177 in viable drugs. Developing a drug thus requires tremendous amounts of data to be analyzed from various clinical trial platforms like a variety of patients under different disease conditions spread across the globe. The number of parameters to be analyzed is also more than anywhere else. A compound has to be tested for various chemical, physical as well as biological parameters like storage conditions, shelf life, side effects and adverse effects to name a few in a variety of testing environments. Also, obtaining approval from regulatory agencies has stricter parameters with very little margin for error than any other industry, requiring a powerful analysis of data without error. Some of the areas where data mining can be applied to pharmaceutical industry are: Clinical data analysis clinical data analysis evaluates and streamlines from large amount of information. Data mining helps to see trends, irregularity, and risk during product development and launch.. Marketing and sales analysis marketing and sales analysis enables the identification of the most profitable product and allocation of marketing funds. Data mining here helps to examine consumer behavior in terms of prescription renewal and product purchases. Customer analysis using data mining one can develop more targeted customer profiles that focus not only on products, but also on the ability to pay for them by analyzing historical health trends in combination with demographics.. Identify and target individuals and demographics that could be considered undiagnosed with educational campaigns whose goal is to encourage these individuals to get screened

41 178 and tested for possible issues. Combine product sales information with customer groups and customer channel information to analyze what tends to lead customers to fill prescriptions at a more consistent rate or what leads physicians to prescribe certain drugs at a higher rate. Operations and financial analysis analyze the prescription activity in a geographic region or area to make sales force adjustments according to market size or penetration.. Dissect buying trends from the largest customers (managed care providers and governments) to proactively create price points that benefit both the buyer and the organization. Sales and marketing analysis provide mobile analytics to a sales force that is consistently disconnected, allowing them to answer not only detailed drug information questions, but also historical and trending questions.. Target physicians who have high prescription rates of a certain drug or treatment with new drug information that treat complementary symptoms or conditions. Product analysis analyze buying tendencies and treatment outcomes to create more drug and product variations tailored directly towards different age groups and risk factors.. Combine Product analysis demographics and patient historical trends to target quality of life needs of patients (i.e. lifestyle drugs) that improve the day-to-day living standards of patients, especially for non-acute medical conditions. Supply chain analysis improve production schedules through analysis of which pharma products stay on the

42 179 shelves the longest and how well each pharma product is selling.. Manage inventories more efficiently based on historical trends and patient behavior to prevent stock-outs at retail and pharmacy locations. There are four stages of drug development. They are finding the new drug through research, development of test and predicting the drug behavior, clinical trials test the drug in humans and, finally, commercialization takes drug and sells it to likely consumers (doctors and patients) Discovery New Drugs The general goal is to find activity on relevant genes or to find drug compounds that have desirable characteristics (whatever those may be). Here the data mining techniques used are clustering, classification and neural networks. The goal is to determine compounds with similar activity. The reason for this is the compounds with similar activity may behave similarly. This should be performed when we have known compound and are looking for something better or when we do not have known compound but have desired activity and want to find compound that exhibits this activity. This can be achieved by clustering the molecules into groups according to the chemical properties of the molecules via cluster analysis (De Cooman 2005). This way every time a new molecule is discovered it can be grouped with other chemically similar molecules. This would help the researchers in finding out with therapeutic group the new molecule would belong to. Mining can help us to measure the chemical activity of the molecule on specific disease say tuberculosis and find out which part of the molecule is causing the action. This way we can combine a vast number of molecules forming a super molecule with only the specific part of the

43 180 molecule that is responsible for the action and inhibiting the other parts. This would greatly reduce the adverse effects associated with drug actions Prediction of Drugs There are many issues that affect the success of a drug that has been marketed which can impact the future development of the drug. First, adverse reactions to the drugs are reported spontaneously and not in any organized manner. Second, we can only compare the adverse reactions with the drugs of our own company and not with other drugs from competing firms. And third, we only have information on the patient taking the drug not the adverse reaction that the patient is suffering from. All this can be solved with data mining tools where in a basic classification tool can solve much of the challenges. One can find out the adverse reactions associated with a specific drug and still go a step further to show if any specific condition aggravates the adverse reaction. This could help any medical practitioner to describe the side effects to the patients being prescribed these drugs. Pharma companies think that drugs might have some yielded benefits. The drug undergoes testing in animals and human tissue to observe effect and determines how much drug to consume for desired effect or how dangerous is the drug. The data mining techniques that can be here used are classification and neural networks. The goal here is to predict if treatment will aid patients, because if a drug will not aid patients, what purpose does the drug serve? Predicting the drug behavior is essential when we have data supporting use of the drug and also have training data that show effects of the drug (positive or negative). The test should be able to predict which patients will benefit and which treatment help sickle cell anemia patients. Information like gender, body weight, disease state, etc. will play a crucial role. These crucial data should be fed into a neural network and predict whether the patient will benefit from the drug.