Enterprise Miner (EM) Intelligent Miner for Data (IM) Pattern Recognition Workbench (PRW)
|
|
|
- Hector Logan
- 10 years ago
- Views:
Transcription
1 An Evaluation of High-end Data Mining Tools for Fraud Detection Dean W. Abbott Elder Research San Diego, CA I. Philip Matkovsky Federal Data Corporation Bethesda, MD John F. Elder IV, Ph.D. Elder Research Charlottesville, VA ABSTRACT 1 Data mining tools are used widely to solve real-world problems in engineering, science, and business. As the number of data mining software vendors increases, however, it has become more challenging to assess which of their rapidly-updated tools are most effective for a given application. Such judgement is particularly useful for the high-end products due to the investment (money and time) required to become proficient in their use. Reviews by objective testers are very useful in the selection process, but most published to date have provided somewhat limited critiques, and haven t uncovered the critical benefits and shortcomings which can probably only be discovered after using the tool for an extended period of time on real data. Here, five of the most highly acclaimed data mining tools are so compared on a fraud detection application, with descriptions of their distinctive strengths and weaknesses, and lessons learned by the authors during the process of evaluating the products. 1. IRODUCTION The data mining tool market has become more crowded in recent years, with more than 50 commercial data mining tools, for example, listed at the KDNuggets web site ( Rapid introduction of new and upgraded tools is an exciting development, but does create difficulties for potential purchasers trying to assess the capabilities of off-the-shelf tools. Dominant vendors have yet to emerge (though consolidation is widely forecast). According to DataQuest, in 1997, IBM was the data mining software market leader with a 15% share of license revenue, Information Discovery was second with 10%, Unica was third with 9%, and Silicon Graphics was fourth with 6% [1]. Several useful comparative reviews have recently appeared including those in the popular press (e.g., DataMation[2]) and two detailed reports sold by consultants: the Aberdeen Group [3] and Two Crows [4]. Yet most reviews do not (and perhaps cannot) address many of the specific application considerations critical to many companies. And, few show evidence of extensive practical experience with the tools. 1 This research was supported by the Defense Finance Accounting Service Contract N D-8055 under the direction of Lt. Cdr. Todd Friedlander, in a project initiated by Col. E. Hutchison. This paper summarizes a recent extensive evaluation of high-end data mining tools for fraud detection. Section 2 outlines the tool selection process, which began with dozens of tools, and ended with five examined in depth. Multiple categories of evaluation are outlined; including the tool s suitability to the intended users and computer environment, automation capabilities, types and quality of algorithms implemented, and ease of use. Lastly, each tool was extensively exercised on real data to assess its accuracy and its strengths in the loop. 2. TOOL SELECTION 2.1 Computer System Environment System requirements, including supported computer platforms, relational databases, and network topologies, are often particular to a company or project. For this evaluation, a client-server environment was desirable, due to the large data sets that were to be analyzed. The environment was comprised of a multi-processor Sun server running Solaris 2.x, and client PCs running. All computers ran on an Ethernet network using TCP/IP. Data typically resided in an Oracle database, though smaller tables could be exported to ASCII format and copied to the server or client. 2.2 Intended End-User The analysts that will use the data mining tools evaluated were specified as non-experts; that is, they would have knowledge of data mining, but not be statisticians or experts at understanding the underlying mathematics used in the algorithms. They are, however, domain experts. Therefore, the data mining tools had to use language a novice would understand, and provide guidance for the non-expert. 2.3 Selection Process Thorough tool evaluation is time-intensive, so a twostage selection phase preceded in-depth evaluation. For the first stage, more than 40 data mining tools/vendors were rated on six qualities: product track record vendor viability breadth of data mining algorithms in the tool compatibility with a specific computer environment ease of use the ability to handle large data sets
2 Several expert evaluators were asked to rate the apparent strength of each tool in each category as judged from marketing material, reviews, and experience. The scores within each category were averaged, and the category scores were weighted and summed to create a single score for each product. The top 10 tools continued to the second stage of the selection phase. The 10 remaining tools were further rated on several additional characteristics: experience in the fraud domain, quality of technical support, and ability to export models to other environments as source code or ASCII text. When possible, tools were demonstrated by expert users (usually, vendor representatives), who answered detailed algorithmic and implementation questions. 2 The expert evaluators re-rated each tool characteristic, and the top five tools were selected for extensive hands-on evaluation. They are listed, in alphabetical order of product, in Table 1. 3 Table 1: Data Mining Products Evaluated Company Product Version Integral Solutions, Ltd. (ISL) [5] Thinking Machines (TMC) [6] SAS Institute [7] IBM [8] Unica Technologies, Inc. [9] Clementine 4.0 Darwin Enterprise (EM) Intelligent for Data (IM) Pattern Recognition Workbench (PRW) Beta 3. PRODUCT EVALUATION All five tools evaluated are top-notch products that can be used effectively for discovering patterns in data. They are well known in the data mining community, and have proven themselves in the marketplace. This paper helps distinguish the tools from each other, and outlines their strengths and weaknesses in the context of a fraud application. The properties evaluated included the areas of client-server compliance, automation capabilities, breadth of algorithms implemented, ease of use, and overall accuracy on fraud-detection test data. 3.1 Client-Server Processing A primary difference between high-end and less expensive data mining products is scalability to large data sets One vendor withdrew at the second stage to avoid such scrutiny. 3 One product that originally reached the final stage was found, on installation, to have stability flaws undetected in earlier stages. Data mining applications often use data sets far too large to be retained in physical RAM, slowing down processing considerably as data is loaded to and from virtual memory. Also, algorithms run far slower when dozens or hundreds of candidate inputs are considered in models. Therefore, the client-server-processing model has great appeal: use a single high-powered workstation for processing, but let multiple analysts access the tools from PCs on their desks. Still, one s network bandwidth capability has a dramatic influence on which of these tools will operate well. Table 2 describes the characteristics of each tool as they relate to this project (other platforms supported by the data mining tool vendors are not listed). Note that some tools did not meet every system specification. For example, Intelligent at the time of this evaluation did not support Solaris (server) 4. Clementine had no client, and PRW had no relational database connectivity, and wasn t client-server. Table 2: Software and Hardware Supported Product Server Client Oracle Connect Clementine Solaris 2.X X Server side ODBC Darwin Solaris 2.X Server side ODBC Enterprise Solaris 2.X 1 SAS Connect for Oracle Intelligent IBM AIX IBM Data Joiner PRW Data only Client side ODBC 1 Most testing performed on a standalone version. The products implement client/server in a wide variety of ways. Darwin best implements the paradigm as the client requiring the least processing and network traffic. We were able to use Darwin from a client accessing a server over 28.8 Kbaud modems without appreciable speed loss because the client interface passed only command line arguments to the server (and the data sets used for most of this stage were relatively small). Clementine was tested as a standalone Unix application and without a native client. However, we used X-terminal emulation software on the PC to display the windows (Microimage s X Server). Because the entire window has to be transmitted over the network, the network and processor requirements were much higher than for the other tools. Performance on modem lines was unacceptably slow. 4 A Solaris version of Intelligent for Data is due out in 3 rd quarter 1998.
3 PRW was tested as a standalone application for. Data is accessed via a file server or database on the network, but all processing takes place on the analyst's computer. Therefore, processor capabilities on the client (which is also the server) must be significantly better than is required for the others. Intelligent for Data ran a Java client, allowing it to run on nearly any operating system. Java runs more slowly than other GUI designs, but this wasn't an appreciable problem in our testing. Enterprise was tested primarily on as a standalone tool because the Solaris version was released during the evaluation period. It has the largest disk footprint of any of the tools, at 300+MB. Note: for brevity, the products will be referred to henceforth primarily by the names of the vendor: IBM (Intelligent for Data), ISL (Clementine), SAS (Enterprise ), TMC (Darwin), and Unica (PRW). 3.2 Automation and Project Documentation The data mining process is iterative, with model building and testing repeated dozens of times [10]. The experimentation process involves repeatedly adjusting algorithm parameters, candidate inputs, and sample sets of the training data. It would be a great help to automate what can be in this process in order to free the analyst from some of the mundane and error-prone tasks of linking and documenting exploratory research findings. Here, documentation was judged successful if the analyst could reproduce results from the notes, cues, and saved files made available by the data-mining tool. All five products provided means to document findings during the research process, including time and date stamps on models, text fields to hold notes about the particular model, and the saving of guiding parameters. The visual-programming interface of ISL and SAS uses an icon for each data mining operation (file input, data transformation, modeling algorithm, data analysis module, plot, etc.). ISL s version is particularly impressive and easy to use, which makes understanding and documenting the steps taken during model creation very clear. ISL also provides a macro language, Clem, for advanced data manipulation, and an automated way to find Neural Network architecture (number of hidden layers and number of nodes in a hidden layer). TMC is run manually by the analyst via pull-down menus. At each step in an experiment, the options selected are recorded and retained (along with free-text notes the analyst thinks to include) providing a record of guidance parameters, along with a date and time stamp. Unica employs an experiment manager to control and automate model building and testing. There, multiple algorithms can be scheduled to run in batch on the same data. In addition, the analyst can specify a search over algorithm parameters or data fields used in modeling. Unica allows free-text notes, and can encode models as internal functions with which to test on new data. IBM uses a wizard to specify each model. The Neural Network wizard (by default) automatically establishes the architecture, or the user can specify it manually. 3.3 Algorithms Data mining tools containing multiple algorithms usually include Neural Networks, Decision Trees and perhaps one other, such as Regression or Nearest Neighbor. Table 3 lists the algorithms implemented in the five tools evaluated. (Comments below focus on the distinctive features of each, but not all their capabilities.) Table 3: Algorithms Implemented Algorithm IBM ISL SAS TMC Unica Decision Trees Neural Networks Regression Radial Basis Functions Nearest Neighbor Nearest Mean Kohonen Self- Organizing Maps Clustering Association Rules accessed in data analysis only 2 estimation only (not for classification) Three of the tools included logistic (and linear) Regression, which is an excellent baseline from which to compare the more complex non-linear methods. Unica has implemented the most diverse sets of algorithms (though they do not have the single most popular Decision Trees), and includes an extensive set of options for each. Importantly, one can search over a range of parameters and candidate inputs automatically.
4 SAS has the next most diverse set, and provides extensive controls over algorithm parameters. ISL implements many algorithms and has a good set of controls. TMC offers three algorithms with good controls, but does not yet have unsupervised learning (clustering). IBM has only two mainstream classification algorithms (though Radial Basis Functions are available for estimation), and their controls are minimal. However, IBM also includes a Sequential Pattern (time series) discovery tool Decision Tree Options Table 4 compares the options available for the Decision Tree algorithms. Advanced pruning refers to use of cross-validation or a complexity penalty to prune trees. Table 4: Options for Decision Trees Algorithm IBM ISL SAS TMC Handles Real-Valued Data Costs for Misclassification Assign Priors to Classes Table 5: Options for Neural Networks Algorithm IBM ISL SAS TMC Unica Automatic Architecture Selection Advanced Learning Algorithms Assign Priors to Classes Costs for Misclassification Cross-Validation Stopping Rule 3.4 Ease of Use The four categories by which ease-of-use was evaluated are listed in Table 6. Each category score is the average of multiple specific sub-categories scored independently by five to six users, essentially evenly split between data mining experts and novices. The overall usability score is a weighted average of these four components, with model understanding weighted twice that of the others because of its importance for this application. Costs for Fields Table 6: Ease of Use Comparison Multiple Splits Category IBM ISL SAS TMC Unica Advanced Pruning Options Graphical Display of Trees All of the tools that implement trees allow one to prune the trees manually after training, if desired. TMC, in fact, does not select a best tree, but has the analyst identify the portion of the full tree to be implemented. TMC also does not provide a graphical view of the tree. IBM provides only limited options for tree generation. Unica does not offer Decision Trees at all Neural Network Options ISL and Unica provide a clear and simple way to search over multiple Neural Network architectures to find the best model. That capability is also in IBM. Surprisingly, none adjust for differing costs among classes, and only Unica allows different prior probabilities to be taken into account. All but IBM have advanced learning options and employ cross-validation to govern when to stop. Table 5 summarizes these properties. Data Load and Manipulation Model Building Model Understanding Technical Support Overall Usability Loading Data ISL, Unica, and IBM allow ASCII data to reside anywhere. For TMC, it must be within a project directory on the Unix server. SAS must first load data into a SAS table and store it in a SAS library directory. ISL, Unica and SAS automatically read in field labels from the first line of a data file and determine the data type for each column. Unica and TMC can save data types (if changes have been made from the default) to a file for reuse. For TMC and IBM, the user must specify each field in the dataset manually either in a file (TMC) or dialog (IBM).
5 3.4.2 Transforming Data All five tools provide means to transform data, including row operations (randomizing, splitting into training and testing sets, sub-sampling) and column operations (merging fields, creating new fields). TMC has built-in functions for many such operations. ISL and Unica use custom macro languages for field operations, though ISL has several built-in column operations as well. SAS has extensive data cleaning options, including a graphical outlier filter Specifying Models ISL and SAS specify models by editing a node in the icon stream. TMC and IBM employ dialog boxes and Unica uses the experiment manager. The last is more flexible, but takes a bit longer to learn Reviewing Trees IBM and SAS have graphical tree browsers, and IBM s is particularly informative. Each tree node is represented by a pie chart containing class distributions for the parent node and the current split, showing clearly the improved purity from the split. The numerical results are displayed as well, and trees can be pruned by clicking on nodes to collapse sub-trees. SAS uses a color to indicate the purity of nodes, with text inside each node. ISL and TMC represent trees as text-based rules. ISL can collapse trees via clicking. TMC allows one to select subtrees for rule display Reviewing Classification Results ISL, Unica, and IBM automatically generate confusion matrices (a cross-table of true vs. predicted classes). TMC though, first requires one to merge together the prediction and target (true) vectors. SAS is currently the least capable in this area, as one must vectors to another program to view such a table Support All five tools had excellent on-line help. Three also provided printed documentation (TMC and SAS did not). All also supplied phone technical support, with Unica and ISL delivering the most timely and comprehensive technical help to problems. 3.5 Accuracy The data used to grade the accuracy of the tools contained fraudulent and non-fraudulent financial transactions. The goal given the data mining algorithms was to find as many fraudulent transactions as possible without incurring too many false alarms (transactions determined to be fraudulent by the data mining tools, but in fact not). Ascertaining the optimum tradeoff between these two quantities is essential if a single grade is to result. Yet, interviewing fraud domain experts provided only general guidelines, and not a rule for making such a tradeoff. Therefore, the ability of each algorithm to adjust to variable misclassification costs is important. Instead of building a single model for each tool, multiple models were generated; the resulting range defined a curve trading off the number of fraud transactions caught versus false alarms. Half of the transaction data was used to create models, and half reserved for an independent test of model performance (evaluation data). To avoid compromising the independence of the evaluation data set, it was not used to gain insight into model structures, help determine when to stop training Neural Networks or Decision Trees, or validate the models. Approximately 20 models were created for each tool, including at least one from most of the algorithms available. Results are shown here only for Neural Networks and Decision Trees because they allowed the best crosscomparison, and because they proved to be better than the other models, in general. The best accuracy results obtained on the evaluation dataset are shown below. Figure 1 displays a count of false alarms obtained by each tool using Neural Networks and Decision Trees. (PRW, as noted, cannot yet build trees). Figure 2 displays a count of the fraudulent transactions identified. (While smaller is better for Figure 1, larger is better for Figure 2.) Number False Alarms Neural Networks Decision Trees Darwin Clementine PRW Enterprise Data Mining Tool Intelligent Figure 1: Accuracy Comparison: False Alarms (smaller is better) Note that Decision Trees were better than Neural Networks at reducing false alarms. This is probably primarily due to two factors. First, most of the trees allowed one to specify misclassification costs, so nonfraudulent transactions could be explicitly given a higher cost in the data, reducing their number missed. Secondly, pruning options for the trees were somewhat better developed than the stopping rules for the networks, so the hazard of overfit was less. (Note that, in other
6 applications, we have often found the exact opposite in performance. Accuracy evaluations to be done right, must use data very close to the end application.) Number Fraudulent Transactions Caught Neural Networks Decision Trees Darwin Clementine PRW Enterprise Data Mining Tool Intelligent Figure 2: Accuracy Comparison: Fraudulent Transactions Caught (larger is better) 4. LESSONS LEARNED The expense of purchasing and learning to use high-end tools compels one to first clearly define their intended environment; the amount and type of data to be analyzed, the level of expertise of the analysts, and the computing system on which the tools will run. 4.1 Define Implementation Requirements User experience level: Will novices be creating models or only using results? How much technical support will be needed? Computer environment: Specify hardware platform, operating system, databases, server/client, etc. Nature of data: Size, location of users (bandwidth needed). Is there a target variable? (Is learning supervised or unsupervised?) Manner of deployment: Can the models be run from within the tool? Will they be deployed in a database (SQL commands) or a simulation (source code)? 4.2 Test in Your Environment using Your Data Surprises abound. Lab-only testing might miss critical positive and negative features. (Indeed, we learned more than we anticipated at each stage of scrutiny.) 4.3 Obtain Training Data mining tools have improved significantly in usability, but most include functionality difficult for a novice user to use effectively. For example, several tools have a macro language with which one can manipulate data more effectively, or automate processing (e.g., Clementine, PRW, Darwin, and Enterprise ). If significant time will be spent with the tool, or if a fast turnaround is necessary, training should reduce errors, and make the modeling process more efficient. 4.4 Be Alert for Product Upgrades The data mining tool industry is changing rapidly. Even during our evaluation, three vendors introduced versions that run under Solaris 2.x (SAS s Enterprise, IBM s Intelligent, and Unica s Model 1). While new data mining algorithms are rare, advances in their support environments are rapid. Practitioners have much to look forward to. 5. CONCLUSIONS The five products evaluated here all display excellent properties, but each may be best suited for a different environment. IBM s Intelligent for Data has the advantage of being the current market leader, with a strong vendor offering well-regarded consulting support. ISL s Clementine excels in support provided and in ease of use (given Unix familiarity) and might allow the most modeling iterations in a tight deadline. SAS s Enterprise would especially enhance a statistical environment where users are familiar with SAS and could exploit its macros. Thinking Machine s Darwin is best when network bandwidth is at a premium (say, on very large databases). And Unica s Pattern Recognition Workbench is a strong choice when it s not obvious what algorithm will be most appropriate, or when analysts are more familiar with spreadsheets than Unix. 6. REFERENCES [1] Data Mining News, Volume 1, No. 18, May 11, [2] datamine/stories/unearths.htm [3] Hill, D. and Moran R., Enterprise Data Mining Buying Guide: 1997 Edition, Aberdeen Group, Inc., [4] Data Mining '98, [5] Integral Solutions, Ltd., [6] Thinking Machines Corp., [7] SAS Institute, [8] IBM, [9] Unica Technologies, Inc., [10] Fayyad, U., Piatetsky-Shapiro, G. and Smyth, P. "From Data Mining to Knowledge Discovery: An Overview." In Advances in Knowledge Discovery and Data Mining. Fayyad et al (Eds.) MIT Press, 1996.
A Comparison of Leading Data Mining Tools
A Comparison of Leading Data Mining Tools John F. Elder IV & Dean W. Abbott Elder Research Fourth International Conference on Knowledge Discovery & Data Mining Friday, August 28, 1998 New York, New York
IBM SPSS Modeler Professional
IBM SPSS Modeler Professional Make better decisions through predictive intelligence Highlights Create more effective strategies by evaluating trends and likely outcomes. Easily access, prepare and model
IBM SPSS Modeler 15 In-Database Mining Guide
IBM SPSS Modeler 15 In-Database Mining Guide Note: Before using this information and the product it supports, read the general information under Notices on p. 217. This edition applies to IBM SPSS Modeler
Make Better Decisions Through Predictive Intelligence
IBM SPSS Modeler Professional Make Better Decisions Through Predictive Intelligence Highlights Easily access, prepare and model structured data with this intuitive, visual data mining workbench Rapidly
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES
HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES Translating data into business value requires the right data mining and modeling techniques which uncover important patterns within
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
Machine Learning with MATLAB David Willingham Application Engineer
Machine Learning with MATLAB David Willingham Application Engineer 2014 The MathWorks, Inc. 1 Goals Overview of machine learning Machine learning models & techniques available in MATLAB Streamlining the
Polynomial Neural Network Discovery Client User Guide
Polynomial Neural Network Discovery Client User Guide Version 1.3 Table of contents Table of contents...2 1. Introduction...3 1.1 Overview...3 1.2 PNN algorithm principles...3 1.3 Additional criteria...3
Introduction to Data Mining
Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association
Advanced analytics at your hands
2.3 Advanced analytics at your hands Neural Designer is the most powerful predictive analytics software. It uses innovative neural networks techniques to provide data scientists with results in a way previously
Data Mining Applications in Higher Education
Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2
Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA
Welcome Xindong Wu Data Mining: Updates in Technologies Dept of Math and Computer Science Colorado School of Mines Golden, Colorado 80401, USA Email: xwu@ mines.edu Home Page: http://kais.mines.edu/~xwu/
White Paper. Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics
White Paper Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics Contents Self-service data discovery and interactive predictive analytics... 1 What does
How to Optimize Your Data Mining Environment
WHITEPAPER How to Optimize Your Data Mining Environment For Better Business Intelligence Data mining is the process of applying business intelligence software tools to business data in order to create
Microsoft Dynamics CRM 2011 Guide to features and requirements
Guide to features and requirements New or existing Dynamics CRM Users, here s what you need to know about CRM 2011! This guide explains what new features are available and what hardware and software requirements
Data Mining with SQL Server Data Tools
Data Mining with SQL Server Data Tools Data mining tasks include classification (directed/supervised) models as well as (undirected/unsupervised) models of association analysis and clustering. 1 Data Mining
Data Mining. SPSS Clementine 12.0. 1. Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine
Data Mining SPSS 12.0 1. Overview Spring 2010 Instructor: Dr. Masoud Yaghini Introduction Types of Models Interface Projects References Outline Introduction Introduction Three of the common data mining
EMC Invista: The Easy to Use Storage Manager
EMC s Invista SAN Virtualization System Tested Feb. 2006 Page 1 of 13 EMC Invista: The Easy to Use Storage Manager Invista delivers centrally managed LUN Virtualization, Data Mobility, and Copy Services
Data Mining Applications in Fund Raising
Data Mining Applications in Fund Raising Nafisseh Heiat Data mining tools make it possible to apply mathematical models to the historical data to manipulate and discover new information. In this study,
APPROACHABLE ANALYTICS MAKING SENSE OF DATA
APPROACHABLE ANALYTICS MAKING SENSE OF DATA AGENDA SAS DELIVERS PROVEN SOLUTIONS THAT DRIVE INNOVATION AND IMPROVE PERFORMANCE. About SAS SAS Business Analytics Framework Approachable Analytics SAS for
2015 Workshops for Professors
SAS Education Grow with us Offered by the SAS Global Academic Program Supporting teaching, learning and research in higher education 2015 Workshops for Professors 1 Workshops for Professors As the market
Database Marketing, Business Intelligence and Knowledge Discovery
Database Marketing, Business Intelligence and Knowledge Discovery Note: Using material from Tan / Steinbach / Kumar (2005) Introduction to Data Mining,, Addison Wesley; and Cios / Pedrycz / Swiniarski
Data Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
How To Use Data Mining For Loyalty Based Management
Data Mining for Loyalty Based Management Petra Hunziker, Andreas Maier, Alex Nippe, Markus Tresch, Douglas Weers, Peter Zemp Credit Suisse P.O. Box 100, CH - 8070 Zurich, Switzerland [email protected],
KnowledgeSEEKER Marketing Edition
KnowledgeSEEKER Marketing Edition Predictive Analytics for Marketing The Easiest to Use Marketing Analytics Tool KnowledgeSEEKER Marketing Edition is a predictive analytics tool designed for marketers
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
A fast, powerful data mining workbench designed for small to midsize organizations
FACT SHEET SAS Desktop Data Mining for Midsize Business A fast, powerful data mining workbench designed for small to midsize organizations What does SAS Desktop Data Mining for Midsize Business do? Business
Data processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
Understanding the Benefits of IBM SPSS Statistics Server
IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster
hmetrix Revolutionizing Healthcare Analytics with Vertica & Tableau
Powered by Vertica Solution Series in conjunction with: hmetrix Revolutionizing Healthcare Analytics with Vertica & Tableau The cost of healthcare in the US continues to escalate. Consumers, employers,
White Paper. Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices.
White Paper Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices. Contents Data Management: Why It s So Essential... 1 The Basics of Data Preparation... 1 1: Simplify Access
A Review of Data Mining Techniques
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,
STATISTICA. Financial Institutions. Case Study: Credit Scoring. and
Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT
Microsoft Access is an outstanding environment for both database users and professional. Introduction to Microsoft Access and Programming SESSION
539752 ch01.qxd 9/9/03 11:38 PM Page 5 SESSION 1 Introduction to Microsoft Access and Programming Session Checklist Understanding what programming is Using the Visual Basic language Programming for the
I/A Series Information Suite AIM*DataLink
PSS 21S-6C4 B3 I/A Series Information Suite AIM*DataLink AIM*DataLink AIM*DataLink provides easy access to I/A Series real-time data objects and historical information from Windows-based applications.
Business Intelligence Tutorial
IBM DB2 Universal Database Business Intelligence Tutorial Version 7 IBM DB2 Universal Database Business Intelligence Tutorial Version 7 Before using this information and the product it supports, be sure
CUSTOMER Presentation of SAP Predictive Analytics
SAP Predictive Analytics 2.0 2015-02-09 CUSTOMER Presentation of SAP Predictive Analytics Content 1 SAP Predictive Analytics Overview....3 2 Deployment Configurations....4 3 SAP Predictive Analytics Desktop
Version 14.0. Overview. Business value
PRODUCT SHEET CA Datacom Server CA Datacom Server Version 14.0 CA Datacom Server provides web applications and other distributed applications with open access to CA Datacom /DB Version 14.0 data by providing
A Business Intelligence Training Document Using the Walton College Enterprise Systems Platform and Teradata University Network Tools Abstract
A Business Intelligence Training Document Using the Walton College Enterprise Systems Platform and Teradata University Network Tools Jeffrey M. Stewart College of Business University of Cincinnati [email protected]
Data Mining Analytics for Business Intelligence and Decision Support
Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing
In-Database Analytics
Embedding Analytics in Decision Management Systems In-database analytics offer a powerful tool for embedding advanced analytics in a critical component of IT infrastructure. James Taylor CEO CONTENTS Introducing
IBM SPSS Modeler 14.2 In-Database Mining Guide
IBM SPSS Modeler 14.2 In-Database Mining Guide Note: Before using this information and the product it supports, read the general information under Notices on p. 197. This edition applies to IBM SPSS Modeler
The Prophecy-Prototype of Prediction modeling tool
The Prophecy-Prototype of Prediction modeling tool Ms. Ashwini Dalvi 1, Ms. Dhvni K.Shah 2, Ms. Rujul B.Desai 3, Ms. Shraddha M.Vora 4, Mr. Vaibhav G.Tailor 5 Department of Information Technology, Mumbai
ORACLE OPS CENTER: PROVISIONING AND PATCH AUTOMATION PACK
ORACLE OPS CENTER: PROVISIONING AND PATCH AUTOMATION PACK KEY FEATURES PROVISION FROM BARE- METAL TO PRODUCTION QUICKLY AND EFFICIENTLY Controlled discovery with active control of your hardware Automatically
DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.
DATA MINING TECHNOLOGY Georgiana Marin 1 Abstract In terms of data processing, classical statistical models are restrictive; it requires hypotheses, the knowledge and experience of specialists, equations,
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam
Hurwitz ValuePoint: Predixion
Predixion VICTORY INDEX CHALLENGER Marcia Kaufman COO and Principal Analyst Daniel Kirsch Principal Analyst The Hurwitz Victory Index Report Predixion is one of 10 advanced analytics vendors included in
The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia
The Impact of Big Data on Classic Machine Learning Algorithms Thomas Jensen, Senior Business Analyst @ Expedia Who am I? Senior Business Analyst @ Expedia Working within the competitive intelligence unit
Modern Payment Fraud Prevention at Big Data Scale
This whitepaper discusses Feedzai s machine learning and behavioral profiling capabilities for payment fraud prevention. These capabilities allow modern fraud systems to move from broad segment-based scoring
Foundations of Business Intelligence: Databases and Information Management
Foundations of Business Intelligence: Databases and Information Management Problem: HP s numerous systems unable to deliver the information needed for a complete picture of business operations, lack of
Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD
Predictive Analytics Techniques: What to Use For Your Big Data March 26, 2014 Fern Halper, PhD Presenter Proven Performance Since 1995 TDWI helps business and IT professionals gain insight about data warehousing,
IBM SPSS Modeler Server 16 Administration and Performance Guide
IBM SPSS Modeler Server 16 Administration and Performance Guide Note Before using this information and the product it supports, read the information in Notices on page 67. Product Information This edition
Agent vs. Agent-less auditing
Centennial Discovery Agent vs. Agent-less auditing Building fast, efficient & dynamic audits As network discovery solutions have evolved over recent years, two distinct approaches have emerged: using client-based
CA Virtual Assurance/ Systems Performance for IM r12 DACHSUG 2011
CA Virtual Assurance/ Systems Performance for IM r12 DACHSUG 2011 Happy Birthday Spectrum! On this day, exactly 20 years ago (4/15/1991) Spectrum was officially considered meant - 2 CA Virtual Assurance
Is a Data Scientist the New Quant? Stuart Kozola MathWorks
Is a Data Scientist the New Quant? Stuart Kozola MathWorks 2015 The MathWorks, Inc. 1 Facts or information used usually to calculate, analyze, or plan something Information that is produced or stored by
ORACLE DATABASE 10G ENTERPRISE EDITION
ORACLE DATABASE 10G ENTERPRISE EDITION OVERVIEW Oracle Database 10g Enterprise Edition is ideal for enterprises that ENTERPRISE EDITION For enterprises of any size For databases up to 8 Exabytes in size.
Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.
Bonus Chapter Ten Major Predictive Analytics Vendors In This Chapter Angoss FICO IBM RapidMiner Revolution Analytics Salford Systems SAP SAS StatSoft, Inc. TIBCO This chapter highlights ten of the major
FOXBORO. I/A Series SOFTWARE Product Specifications. I/A Series Intelligent SCADA SCADA Platform PSS 21S-2M1 B3 OVERVIEW
I/A Series SOFTWARE Product Specifications Logo I/A Series Intelligent SCADA SCADA Platform PSS 21S-2M1 B3 The I/A Series Intelligent SCADA Platform takes the traditional SCADA Master Station to a new
ANALYTICS CENTER LEARNING PROGRAM
Overview of Curriculum ANALYTICS CENTER LEARNING PROGRAM The following courses are offered by Analytics Center as part of its learning program: Course Duration Prerequisites 1- Math and Theory 101 - Fundamentals
SPSS: Getting Started. For Windows
For Windows Updated: August 2012 Table of Contents Section 1: Overview... 3 1.1 Introduction to SPSS Tutorials... 3 1.2 Introduction to SPSS... 3 1.3 Overview of SPSS for Windows... 3 Section 2: Entering
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
BMC Remedy vs. IBM Control Desk. How to choose between BMC Remedy and IBM Control Desk December 2014
BMC Remedy vs. IBM Control Desk How to choose between BMC Remedy and IBM Control Desk December 2014 Version: 1.0 Date: 21/12/2014 Document Description Title BMC Remedy vs. IBM Control Desk Version 1.0
20 A Visualization Framework For Discovering Prepaid Mobile Subscriber Usage Patterns
20 A Visualization Framework For Discovering Prepaid Mobile Subscriber Usage Patterns John Aogon and Patrick J. Ogao Telecommunications operators in developing countries are faced with a problem of knowing
IBM SPSS Modeler Premium
IBM SPSS Modeler Premium Improve model accuracy with structured and unstructured data, entity analytics and social network analysis Highlights Solve business problems faster with analytical techniques
CART 6.0 Feature Matrix
CART 6.0 Feature Matri Enhanced Descriptive Statistics Full summary statistics Brief summary statistics Stratified summary statistics Charts and histograms Improved User Interface New setup activity window
Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI
Data Mining Knowledge Discovery, Data Warehousing and Machine Learning Final remarks Lecturer: JERZY STEFANOWSKI Email: [email protected] Data Mining a step in A KDD Process Data mining:
CS590D: Data Mining Chris Clifton
CS590D: Data Mining Chris Clifton March 10, 2004 Data Mining Process Reminder: Midterm tonight, 19:00-20:30, CS G066. Open book/notes. Thanks to Laura Squier, SPSS for some of the material used How to
CRISP-DM, which stands for Cross-Industry Standard Process for Data Mining, is an industry-proven way to guide your data mining efforts.
CRISP-DM, which stands for Cross-Industry Standard Process for Data Mining, is an industry-proven way to guide your data mining efforts. As a methodology, it includes descriptions of the typical phases
Computer and Information Sciences
Computer and Information Sciences Dr. John S. Eickmeyer, Chairperson Computers are no longer huge machines hidden away in protected rooms and accessible to only a few highly-trained individuals. Instead,
OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP
Data Warehousing and End-User Access Tools OLAP and Data Mining Accompanying growth in data warehouses is increasing demands for more powerful access tools providing advanced analytical capabilities. Key
Numerical Algorithms Group
Title: Summary: Using the Component Approach to Craft Customized Data Mining Solutions One definition of data mining is the non-trivial extraction of implicit, previously unknown and potentially useful
Software: Systems and. Application Software. Software and Hardware. Types of Software. Software can represent 75% or more of the total cost of an IS.
C H A P T E R 4 Software: Systems and Application Software Software and Hardware Software can represent 75% or more of the total cost of an IS. Less costly hdwr. More complex sftwr. Expensive developers
Welcome To Paragon 3.0
Welcome To Paragon 3.0 Paragon MLS is the next generation of web-based services designed by FNIS specifically for agents, brokers, and MLS administrators. Paragon MLS is an amazingly flexible online system
Base One's Rich Client Architecture
Base One's Rich Client Architecture Base One provides a unique approach for developing Internet-enabled applications, combining both efficiency and ease of programming through its "Rich Client" architecture.
Enhancing Compliance with Predictive Analytics
Enhancing Compliance with Predictive Analytics FTA 2007 Revenue Estimation and Research Conference Reid Linn Tennessee Department of Revenue [email protected] Sifting through a Gold Mine of Tax Data
DATA MINING TECHNIQUES AND APPLICATIONS
DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,
Data Mining Solutions for the Business Environment
Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania [email protected] Over
Leveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
VERITAS NetBackup BusinesServer
VERITAS NetBackup BusinesServer A Scalable Backup Solution for UNIX or Heterogeneous Workgroups V E R I T A S W H I T E P A P E R Table of Contents Overview...................................................................................1
How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning
How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume
The Predictive Data Mining Revolution in Scorecards:
January 13, 2013 StatSoft White Paper The Predictive Data Mining Revolution in Scorecards: Accurate Risk Scoring via Ensemble Models Summary Predictive modeling methods, based on machine learning algorithms
imc FAMOS 6.3 visualization signal analysis data processing test reporting Comprehensive data analysis and documentation imc productive testing
imc FAMOS 6.3 visualization signal analysis data processing test reporting Comprehensive data analysis and documentation imc productive testing imc FAMOS ensures fast results Comprehensive data processing
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Local Area Networks: Software and Support Systems
Local Area Networks: Software and Support Systems Chapter 8 Learning Objectives After reading this chapter, you should be able to: Identify the main functions of operating systems and network operating
Tivoli Monitoring for Databases: Microsoft SQL Server Agent
Tivoli Monitoring for Databases: Microsoft SQL Server Agent Version 6.2.0 User s Guide SC32-9452-01 Tivoli Monitoring for Databases: Microsoft SQL Server Agent Version 6.2.0 User s Guide SC32-9452-01
BIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
Software: Systems and Application Software
Software: Systems and Application Software Computer Software Operating System Popular Operating Systems Language Translators Utility Programs Applications Programs Types of Application Software Personal
Example 3: Predictive Data Mining and Deployment for a Continuous Output Variable
Página 1 de 6 Example 3: Predictive Data Mining and Deployment for a Continuous Output Variable STATISTICA Data Miner includes a complete deployment engine with various options for deploying solutions
