SAS Does Data Science: How to Succeed in a Data Science Competition
|
|
|
- Briana Sims
- 10 years ago
- Views:
Transcription
1 Paper SAS SAS Does Data Science: How to Succeed in a Data Science Competition Patrick Hall, SAS Institute Inc. ABSTRACT First introduced in 2013, the Cloudera Data Science Challenge is a rigorous competition in which candidates must provide a solution to a real-world big data problem that surpasses a benchmark specified by some of the world's elite data scientists. The Cloudera Data Science Challenge 2 (in 2014) involved detecting anomalies in the United States Medicare insurance system. Finding anomalous patients, procedures, providers, and regions in the competition s large, complex, and intertwined data sets required industrial-strength tools for data wrangling and machine learning. This paper shows how I did it with SAS. INTRODUCTION The objective of the Cloudera Data Science Challenge 2 was to uncover anomalous patients, procedures, providers, and regions in the United States government s Medicare health insurance system (Cloudera 2014). I approached the discovery of such abnormal patients, procedures, providers, and regions in the Challenge data by using many different techniques techniques that validated and augmented one another whenever possible. I also used several different types of data visualization to explore the Challenge data and assess my results. This paper first summarizes the problems that were specified and data that were supplied by the Challenge sponsors at Cloudera. Then it outlines the techniques and technologies that I used to complete the Challenge, followed by sections that describe in greater detail the approaches I used for data preprocessing and for completing the Challenge deliverables. Results are also discussed for each part of the Challenge, and this paper concludes with some brief recommendations for future work. Supplemental materials, including solution source code and sample data, can be downloaded from SUMMARIES OF PROBLEM, DATA, METHODS, AND TECHNOLOGIES PROBLEM SUMMARY The Challenge was divided into the following three parts, each of which had specific requirements that pertained to identifying anomalous entities in different aspects of the Medicare system: Part 1: Identify providers that overcharge for certain procedures or regions where procedures are too expensive. Part 2: Identify the three providers that are least similar to other providers and the three regions that are least similar to other regions. Part 3: Identify 10,000 Medicare patients who are involved in anomalous activities. The Challenge rules mandated that solutions for each part be packaged into specific deliverables and submitted to Cloudera by a specified deadline. 1
2 DATA SUMMARY Completing the different parts of the Challenge required using several data sources that have varying formats. My solutions for Parts 1 and 2 were based on financial summary data from 2011 that were made available by the Centers for Medicare and Medicaid Services (CMS) in both comma-separated value (CSV) format and Microsoft Excel format from the following CMS links: Inpatient financial summary data: Outpatient financial summary data: Part 3 of the Challenge required preprocessing and analysis of large XML tables that contain patient demographic information and large ASCII-delimited text (ADT) files that contain patient-procedure transaction information. Because of medical record privacy regulations, the data for Part 3 were simulated by the Challenge sponsors at Cloudera and are not publicly available. The supplemental materials provided with this paper include the summary CMS data in CSV format. The original patient demographics XML file and patient-procedure transactional ADT files are not available in the supplemental materials provided with this paper; preprocessed and sampled demographic and transactional SAS data sets are provided. METHODS SUMMARY Table 1 shows the wide variety of data preprocessing, analysis, and visualization techniques that I applied to complete the three parts of the Challenge. Contest Part Analytical Techniques Visualization Techniques 1 Descriptive statistics Straightforward data manipulation Box plots Pie charts 2 Clustering Scatter plots Deep neural networks Euclidean distances Linear regression 3 Association analysis Clustering Graph representations Matrix factorization Constellation plots Table 1. Analytical and Visualization Techniques Used for Each Part of the Challenge TECHNOLOGIES SUMMARY Although size was not the most significant difficulty presented by the Challenge data, the patient data were large enough to require special consideration. Moreover, efficiently producing the requested deliverables for each part of the challenge required the appropriate use of software tools and hardware platforms. I used disk-enabled, multithreaded software tools coupled with a solid state drive (SSD) on a single machine for data preprocessing, analysis, and visualization in the first two parts of the Challenge. For Part 3 of the Challenge, I used both the same single-machine platform and a distributed platform in which data were allocated to numerous compute nodes. 2
3 The following list summarizes the technology that I used: Computing platforms: o Single machine: 24-core blade server with 128 GB RAM and 300 GB SSD o Distributed: 24-node Teradata database appliance Source code management: Git Data preprocessing: bash scripting and Base SAS on the single-machine platform Part 1: Base SAS and SAS/GRAPH on the single-machine platform Part 2: Base SAS, SAS/STAT, and SAS Enterprise Miner TM on the single-machine platform Part 3: Base SAS and SAS Enterprise Miner on the single-machine platform; SAS High-Performance Data Mining and SAS High-Performance Text Mining on the distributed platform DATA PREPROCESSING I downloaded the summary CMS data in CSV format and imported them into SAS format by using SAS DATA step and macro programming techniques. The PNTSDUMP.xml file that was provided by Cloudera contains numerous tables of simulated patient demographic data. I split this large XML file into separate tables by using the bash applications grep, head, sed, and tail. I then imported each table into SAS by using the XML LIBNAME engine. I imported the numerous simulated, transactional ADT files provided by Cloudera using a brute-force approach: a SAS DATA step read each character of the ADT files, caching lines and tokenizing them by using the respective ASCII record and unit delimiters. Importing single files took no longer than several minutes per file in all cases. All imported Challenge data were then validated by using conventional techniques such as building frequency tables and analyzing missing and extreme values. The ccp-ds.sas program in this paper s supplemental materials includes the %get_summary_data macro that converts the summary CMS data to SAS tables. PART ONE: IDENTIFY PROVIDER AND REGIONS WHERE COSTS ARE HIGH METHODS I used DATA step programming, macro programming, and the SORT procedure in Base SAS to manipulate the CMS summary data. I also used the MEANS and UNIVARIATE procedures in Base SAS to calculate descriptive statistics from the same summary data. The ccp-ds.sas file contains the complete solution SAS code for Part 1. TESTING AND VALIDATION I used DATA step programming to implement a simple checksum scheme to validate data manipulations. The CMS summary data contained 130 unique medical procedure codes. My code counted distinct levels of medical procedure codes in tables that were built from several sorts and joins, always ensuring that they summed to 130 distinct levels. RESULTS Part 1A: Highest Cost Variation The three medical procedures that had the most widely varying cost to the patient (whether or not the procedure was expensive to begin with) are Level I Excisions and Biopsies, Level I Hospital Clinic Visits, and Level II Eye Tests and Treatments. The results in Figure 1 indicate that some providers charge extremely large 3
4 amounts for certain medical procedures, despite each procedure code being associated with a set level of severity and each procedure having relatively low mean and median costs. Figure 1. Claimed Charges for the Three Procedures That Have the Highest Coefficient of Variation (Relative Variation) Notable high-cost outliers include Centinela and Whittier Hospital Medical Centers (both in Los Angeles, CA), which charge an average of more than $20,000 for Level 1 Excisions and Biopsies. Lower Bucks Hospital of Philadelphia, PA, charges an average of $9,780 for Level I Hospital Visits, and Ronald Reagan UCLA Medical Center charges an average of $4,187 for Level II Eye Tests and Treatments. Further research should be undertaken to understand whether the identified medical procedures have genuinely variable costs and whether some legitimate or illegitimate relationship exists between the abnormally high-cost procedures delivered in Los Angeles, CA. Parts 1B and 1C: Highest-Cost Claims by Provider and Region The three providers who claim the highest charges for the most number of procedures are Bayonne Hospital Center of Bayonne, NJ; Crozer Chester Medical Center of Philadelphia, PA; and Stanford Hospital of Stanford, CA. Although it is logical that a large, well-respected hospital such as Stanford would account for a substantial number of the highest-cost procedures, it is unclear why lower-profile providers such as the others noted in Figure 2a should account for such a large proportion of the highest-cost procedures. The three regions in which patients are charged the highest amount for the most medical procedures are Contra Costa County, San Mateo County, and the Santa Cruz region, all in the San Francisco Bay area of California. Although such geographical clustering could be representative of fraud, these findings combined with the findings in Part 1A are more likely indicative of the high cost of living in these areas. Because nine regions in California account for 80% of the highest-cost procedures, research into the high cost of health care in that state could result in considerable savings for the Medicare system. Figure 2b illustrates the large proportion of highest-cost procedures that are performed in California. 4
5 Figure 2a. Percentage of Procedures for Which the Noted Provider Charges the Highest Amount Figure 2b. Percentage of Procedures for Which the Noted Region Charges the Highest Amount Part 1D: Highest Number of Procedures and Largest Differences between Claims and Reimbursements The three providers who have the highest number of procedures with the largest difference between their claimed charges to patients and their reimbursement from Medicare are Bayonne Hospital Center in Bayonne, NJ, and Crozer Chester Medical Center and Hahnemann University Hospital, both in Philadelphia, PA. Disproportionate differences between claimed charges and Medicare reimbursement might also indicate the high cost of living in the suburbs of major East Coast cities, where these providers are located. However, these providers might warrant more detailed research because they are outside the previously discovered 5
6 anomalously high-cost regions of California and they all account for a disproportionate number of the absolute highest-cost procedures that were identified in Part 1B. PART TWO: IDENTIFY THE LEAST SIMILAR PROVIDERS AND REGIONS METHODS Before conducting outlier analyses, I augmented the information in the original numeric features by using DATA step programing and macro programming to engineer new numeric features from the provided text data. I generated binary indicators to flag providers as being a university hospital and to flag regions as containing a university hospital. I created interval features for the number of medical procedures of each level that were performed by a provider and in a region: outpatient level information was extracted from the procedure codes and new levels were assigned to inpatient procedures based on the presence of chronic conditions. I used the CORR procedure in SAS/STAT to measure Pearson correlation between all the originally provided features and the new engineered features. One feature each from a small number of correlated feature pairs was rejected from further analyses to eliminate redundancy. I combined two distance-based unsupervised learning approaches to identify points that were the most different from all other points in both the provider and region summary data. I first calculated the entire Euclidean distance matrix of a particular feature space by using the DISTANCE procedure in SAS/STAT. By using the mean of each feature as the origin of that space, points that were farthest from the origin were identified as potentially the least similar points in the summary data. To supplement these findings, I used the FASTCLUS procedure in SAS/STAT to apply k-means clustering to the same feature space, and I also used the aligned box criterion (ABC) to estimate the best number of clusters for the particular feature space (Hall et al. 2013). The aligned box criterion is available through the NOC option of the HPLCUS procedure in SAS Enterprise Miner. The clustering results enabled me to pinpoint the farthest Euclidean distance outliers that also formed their own cluster far from other clusters. To visualize the combined results of both approaches, I used the NEURAL procedure in SAS Enterprise Miner to implement a type of deep neural network, known as a stacked denoising autoencoder, to accurately project the newly labeled points from the particular feature space into a two-dimensional space (Hinton and Salakhutdinov 2006). Because the farthest Euclidean distance outliers are high-profile, well-respected providers, they seemed uninteresting from a fraud detection perspective. I used the REG procedure in SAS/STAT to perform traditional regression outlier analysis on the provided summary data to identify more subtle anomalous points. The ccp-ds.sas file contains the complete solution SAS code for Part 2. TESTING AND VALIDATION I used k-means clustering with the aligned box criterion (ABC) for estimating the best number of clusters to validate the findings from the direct calculation of the full Euclidean distance matrix. Then I used a deep neural network to project the combined results into a two-dimensional space for further exploration and validation. Regression outlier analysis also found the identified Euclidean distance outliers to be leverage points, whereas the regression outliers that had large studentized residuals were sometimes found to be points that resided at the edges of clusters in the cluster analysis. 6
7 RESULTS Part 2A: Providers Least Like Others The three providers that are least like all others are the Cleveland Clinic of Cleveland, OH; UCSF Medical Center in San Francisco, CA; and the Lahey Clinic Hospital of Burlington, MA. Figure 3a is an optimal, nonlinear two-dimensional projection of the provider feature space; it contains Medicare billing information, the number and severity of procedures billed, and an indicator of whether the provider is a university hospital. Because the new axes were generated by a large neural network in which the input features are recombined many times over, they are quite difficult to interpret. In both the initial feature space and the two-dimensional projection, the Cleveland Clinic, UCSF Medical Center, and Lahey Clinic Hospital reside in their own clusters that are far from the origin of the space and far from all other clusters. That these points were placed in their own clusters by a k-means method that uses ABC to estimate the best number of clusters is especially significant because the k-means method prefers spherical clusters of a similar size. In short, there is statistical support for the hypothesis that these providers are unique. I also performed a more traditional regression outlier analysis with the hope of identifying lower-profile providers who are different in less obvious ways. I used residuals and leverage points from a regression of providers claims against providers reimbursement and providers numbers of billed procedures to locate providers who are charging disproportionately high prices for the amount of Medicare reimbursement they are receiving and the number of procedures they are billing. These potentially anomalous providers are Bayonne Hospital Center (Bayonne, NJ), Doctors Hospital of Manteca (Manteca, CA), and Delaware County Memorial Hospital (Drexel Hill, PA). Figure 3b presents these providers as points that have high studentized residual values and low leverage values. Figure 3a. Provider Clusters Projected into Two Dimensions with Labeled Euclidean Distance Outliers 7
8 Figure 3b. Providers Plotted by Studentized Residual and Leverage with Labeled Extreme Outliers Part 2B: Regions Least Like Others Following the same logic and approaches as in Part 2A, I found the following the regions to be the most different from all other regions: Boston, MA; Cleveland, OH; and Los Angeles, CA. The most anomalous regression outliers are Palm Springs, CA; Hudson, FL; and Tyler, TX. PART THREE: IDENTIFY PATIENTS INVOLVED IN ANOMALOUS ACTIVITIES METHODS I used matrix factorization followed by cluster analysis along with a priori association analysis to group 10,000 Medicare patients with Medicare patients who had not been previously selected for manual review. I then aggregated a fraud score for previously unselected patients from both the cluster analysis and the association analysis to determine which patients were the most anomalous. I used DATA step programming to transform patient transaction data into a sparse matrix in dense coordinate list (COO) format. Then I used the HPTMINE procedure in SAS High-Performance Text Mining to decompose that sparse matrix directly from the COO representation into 10 singular value decomposition (SVD) features. The HPDMDB procedure was used to encode nominal data about patients into numerical features. The SVD features were merged with numeric encodings of patient demographic data, and the HPCLUS procedure was used to create 1,000 k-means clusters. Patients who were previously unselected for manual review in clusters where a high proportion of the other patients were previously selected for manual review were given a nonzero preliminary potential fraud score. The higher the proportion of patients previously selected for manual review in a cluster, the higher the previously unselected patients in that cluster were scored for possible fraud. The DMDB procedure and the ASSOC procedure in SAS Enterprise Miner were used for a priori association analysis to identify frequent sets of medical procedures among the general patient population and among 8
9 the patients flagged for manual review. Sets of medical procedures that were frequent within the flagged patient group but infrequent in the general population were assumed to be evidence of anomalous behavior. An unlabeled patient s potential fraud score was incremented for each anomalous set of transactions he or she participated in. To create a final ranking of the 10,000 most suspicious patients, the potential fraud scores from both the cluster and association analyses were combined with approximately equal weighting, and the patients who had the highest overall scores were submitted for additional review. The ccp-ds.sas file contains the complete solution SAS code for Part 3. The ccp-ds.xml and ccp-ds.spk files contain the diagram and model package necessary to recreate the a priori association analysis in SAS Enterprise Miner. The ccp-ds.sas program, ccp-ds.xml diagram, and ccp-ds.spk model package use SAS tables containing demographic and transactional data that were sampled from preprocessed contest data sets. TESTING AND VALIDATION Patients who were identified as potentially anomalous by both cluster analysis and association analysis were the most likely to be submitted for additional review. I profiled cluster results and found that the clusters that contained the highest proportion of manually flagged patients were homogenous. I used the Association node in SAS Enterprise Miner to generate constellation plots of the frequent transactions in the general patient population and among patients flagged for manual review, and I found the two graphs to be conspicuously dissimilar. RESULTS The six patient clusters that had the highest proportions of patients flagged for manual review (therefore the six most suspicious patient clusters) were found to be homogenous groups that were composed primarily of higher-income females in the age range of Several dozen additional clusters of anomalous patients were identified, and these exhibited varying demographic characteristics. The frequent transactions of the general patient population indicate that most patients received one of several most frequent procedures and a small number of other less frequent procedures, likely indicating a pattern of receiving one of many routine procedures followed by a less common follow-up procedure. Manually flagged patients, on the other hand, often received a series of many different procedures a pattern that could be used to identify possible fraudulent behavior in the future. Figures 4a and 4b compare the constellation plots that represent undirected graphs of the normal patients transactions and the flagged patients transactions, respectively. In both figures, larger node and link size and brighter node and link color represent increasing frequency. 9
10 Figure 4a. Frequent Transactions in the General Patient Population Figure 4b. Frequent Transactions in the Patient Population Flagged for Manual Review 10
11 SUMMARY AND RECOMMENDATIONS This Cloudera Data Science Challenge 2 submission revealed a number of anomalous entities in the Medicare system. However, the Challenge occurred over a specified time period of approximately three months, allowing for only a limited amount of analytical sophistication and validation. Given more time and resources, additional techniques and technologies could yield better insights. One class of techniques that could be applied in future analyses are adaptive learning algorithms, which update their results incrementally based the introduction of new data. These methods would enable better understanding of anomalous entities over time, instead of in a static snapshot like the results presented in this paper. In-database analytics is a technology that can be used to very quickly turn analytical findings into operational decisions. Applied to Medicare anomalies, in-database analytics could enable the automatic identification of abnormal behavior followed by a response that is tailored to limit the costs of the detected behavior. Probably the most important path for future work is collaboration with medical or insurance professionals. Several definitions of data science have highlighted the importance of domain expertise in data analyses (Conway 2013 and Press 2013). The application of medical or insurance domain expertise would certainly augment the findings presented here. REFERENCES Cloudera Inc Data Science Challenge 2. Accessed January 12, Conway, D The Data Science Venn Diagram. Accessed January 12, Hall, P., Kaynar-Kabul, I., Sarle, W., and Silva, J Number of Clusters Estimation. U.S. Patent Application Number Hinton, G. E., and Salakhutdinov, R. R Reducing the Dimensionality of Data with Neural Networks. Science 313: Press, G A Very Short History of Data Science. Accessed January 12, ACKNOWLEDGMENTS I thank Anne Baxter, Senior Technical Editor in the Advanced Analytics Division at SAS, for her valuable input to this paper. CONTACT INFORMATION Patrick Hall [email protected] SAS Enterprise Miner 100 SAS Campus Dr. Cary, NC SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 11
USING DATA SCIENCE TO DISCOVE INSIGHT OF MEDICAL PROVIDERS CHARGE FOR COMMON SERVICES
USING DATA SCIENCE TO DISCOVE INSIGHT OF MEDICAL PROVIDERS CHARGE FOR COMMON SERVICES Irron Williams Northwestern University [email protected] Abstract--Data science is evolving. In
2015 Workshops for Professors
SAS Education Grow with us Offered by the SAS Global Academic Program Supporting teaching, learning and research in higher education 2015 Workshops for Professors 1 Workshops for Professors As the market
The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon
The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon ABSTRACT Effective business development strategies often begin with market segmentation,
W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015
W. Heath Rushing Adsurgo LLC Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare Session H-1 JTCC: October 23, 2015 Outline Demonstration: Recent article on cnn.com Introduction
Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP
Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation
Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC
Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC 1. Introduction A popular rule of thumb suggests that
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
A fast, powerful data mining workbench designed for small to midsize organizations
FACT SHEET SAS Desktop Data Mining for Midsize Business A fast, powerful data mining workbench designed for small to midsize organizations What does SAS Desktop Data Mining for Midsize Business do? Business
How To Create A Text Mining Model For An Auto Analyst
Paper 003-29 Text Mining Warranty and Call Center Data: Early Warning for Product Quality Awareness John Wallace, Business Researchers, Inc., San Francisco, CA Tracy Cermack, American Honda Motor Co.,
Data Mining Applications in Higher Education
Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2
Leveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
9.2 User s Guide SAS/STAT. Introduction. (Book Excerpt) SAS Documentation
SAS/STAT Introduction (Book Excerpt) 9.2 User s Guide SAS Documentation This document is an individual chapter from SAS/STAT 9.2 User s Guide. The correct bibliographic citation for the complete manual
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
Technical Paper. Performance of SAS In-Memory Statistics for Hadoop. A Benchmark Study. Allison Jennifer Ames Xiangxiang Meng Wayne Thompson
Technical Paper Performance of SAS In-Memory Statistics for Hadoop A Benchmark Study Allison Jennifer Ames Xiangxiang Meng Wayne Thompson Release Information Content Version: 1.0 May 20, 2014 Trademarks
Paper AA-08-2015. Get the highest bangs for your marketing bucks using Incremental Response Models in SAS Enterprise Miner TM
Paper AA-08-2015 Get the highest bangs for your marketing bucks using Incremental Response Models in SAS Enterprise Miner TM Delali Agbenyegah, Alliance Data Systems, Columbus, Ohio 0.0 ABSTRACT Traditional
TDA and Machine Learning: Better Together
TDA and Machine Learning: Better Together TDA AND MACHINE LEARNING: BETTER TOGETHER 2 TABLE OF CONTENTS The New Data Analytics Dilemma... 3 Introducing Topology and Topological Data Analysis... 3 The Promise
Chapter ML:XI (continued)
Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained
M15_BERE8380_12_SE_C15.7.qxd 2/21/11 3:59 PM Page 1. 15.7 Analytics and Data Mining 1
M15_BERE8380_12_SE_C15.7.qxd 2/21/11 3:59 PM Page 1 15.7 Analytics and Data Mining 15.7 Analytics and Data Mining 1 Section 1.5 noted that advances in computing processing during the past 40 years have
OFFLINE NETWORK INTRUSION DETECTION: MINING TCPDUMP DATA TO IDENTIFY SUSPICIOUS ACTIVITY KRISTIN R. NAUTA AND FRANK LIEBLE.
OFFLINE NETWORK INTRUSION DETECTION: MINING TCPDUMP DATA TO IDENTIFY SUSPICIOUS ACTIVITY KRISTIN R. NAUTA AND FRANK LIEBLE Abstract With the boom in electronic commerce and the increasing global interconnectedness
Hospital Billing Optimizer: Advanced Analytics Solution to Minimize Hospital Systems Revenue Leakage
Hospital Billing Optimizer: Advanced Analytics Solution to Minimize Hospital Systems Revenue Leakage Profit from Big Data flow 2 Tapping the hidden assets in hospitals data Revenue leakage can have a major
Data Visualization Techniques
Data Visualization Techniques From Basics to Big Data with SAS Visual Analytics WHITE PAPER SAS White Paper Table of Contents Introduction.... 1 Generating the Best Visualizations for Your Data... 2 The
SAS Fraud Framework for Banking
SAS Fraud Framework for Banking Including Social Network Analysis John C. Brocklebank, Ph.D. Vice President, SAS Solutions OnDemand Advanced Analytics Lab SAS Fraud Framework for Banking Agenda Introduction
Nagarjuna College Of
Nagarjuna College Of Information Technology (Bachelor in Information Management) TRIBHUVAN UNIVERSITY Project Report on World s successful data mining and data warehousing projects Submitted By: Submitted
Innovative Techniques and Tools to Detect Data Quality Problems
Paper DM05 Innovative Techniques and Tools to Detect Data Quality Problems Hong Qi and Allan Glaser Merck & Co., Inc., Upper Gwynnedd, PA ABSTRACT High quality data are essential for accurate and meaningful
STATISTICA. Financial Institutions. Case Study: Credit Scoring. and
Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT
Big Data Analytics. Benchmarking SAS, R, and Mahout. Allison J. Ames, Ralph Abbey, Wayne Thompson. SAS Institute Inc., Cary, NC
Technical Paper (Last Revised On: May 6, 2013) Big Data Analytics Benchmarking SAS, R, and Mahout Allison J. Ames, Ralph Abbey, Wayne Thompson SAS Institute Inc., Cary, NC Accurate and Simple Analysis
Data Visualization Techniques
Data Visualization Techniques From Basics to Big Data with SAS Visual Analytics WHITE PAPER SAS White Paper Table of Contents Introduction.... 1 Generating the Best Visualizations for Your Data... 2 The
Profit from Big Data flow. Hospital Revenue Leakage: Minimizing missing charges in hospital systems
Profit from Big Data flow Hospital Revenue Leakage: Minimizing missing charges in hospital systems Hospital Revenue Leakage White Paper 2 Tapping the hidden assets in hospitals data Missed charges on patient
Diagrams and Graphs of Statistical Data
Diagrams and Graphs of Statistical Data One of the most effective and interesting alternative way in which a statistical data may be presented is through diagrams and graphs. There are several ways in
KnowledgeSEEKER Marketing Edition
KnowledgeSEEKER Marketing Edition Predictive Analytics for Marketing The Easiest to Use Marketing Analytics Tool KnowledgeSEEKER Marketing Edition is a predictive analytics tool designed for marketers
How to Get More Value from Your Survey Data
Technical report How to Get More Value from Your Survey Data Discover four advanced analysis techniques that make survey research more effective Table of contents Introduction..............................................................2
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
Customer Analytics. Turn Big Data into Big Value
Turn Big Data into Big Value All Your Data Integrated in Just One Place BIRT Analytics lets you capture the value of Big Data that speeds right by most enterprises. It analyzes massive volumes of data
An In-Depth Look at In-Memory Predictive Analytics for Developers
September 9 11, 2013 Anaheim, California An In-Depth Look at In-Memory Predictive Analytics for Developers Philip Mugglestone SAP Learning Points Understand the SAP HANA Predictive Analysis library (PAL)
Make Better Decisions Through Predictive Intelligence
IBM SPSS Modeler Professional Make Better Decisions Through Predictive Intelligence Highlights Easily access, prepare and model structured data with this intuitive, visual data mining workbench Rapidly
COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments
Contents List of Figures Foreword Preface xxv xxiii xv Acknowledgments xxix Chapter 1 Fraud: Detection, Prevention, and Analytics! 1 Introduction 2 Fraud! 2 Fraud Detection and Prevention 10 Big Data for
A Property & Casualty Insurance Predictive Modeling Process in SAS
Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing
IBM SPSS Data Preparation 22
IBM SPSS Data Preparation 22 Note Before using this information and the product it supports, read the information in Notices on page 33. Product Information This edition applies to version 22, release
Data Exploration and Preprocessing. Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)
Data Exploration and Preprocessing Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann
Joseph Twagilimana, University of Louisville, Louisville, KY
ST14 Comparing Time series, Generalized Linear Models and Artificial Neural Network Models for Transactional Data analysis Joseph Twagilimana, University of Louisville, Louisville, KY ABSTRACT The aim
The Forgotten JMP Visualizations (Plus Some New Views in JMP 9) Sam Gardner, SAS Institute, Lafayette, IN, USA
Paper 156-2010 The Forgotten JMP Visualizations (Plus Some New Views in JMP 9) Sam Gardner, SAS Institute, Lafayette, IN, USA Abstract JMP has a rich set of visual displays that can help you see the information
KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES
HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES Translating data into business value requires the right data mining and modeling techniques which uncover important patterns within
Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining
Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 1 What is data exploration? A preliminary
Local outlier detection in data forensics: data mining approach to flag unusual schools
Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential
Medical Information Management & Mining. You Chen Jan,15, 2013 [email protected]
Medical Information Management & Mining You Chen Jan,15, 2013 [email protected] 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
How Organisations Are Using Data Mining Techniques To Gain a Competitive Advantage John Spooner SAS UK
How Organisations Are Using Data Mining Techniques To Gain a Competitive Advantage John Spooner SAS UK Agenda Analytics why now? The process around data and text mining Case Studies The Value of Information
STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA
CALIFORNIA STATE UNIVERSITY, LOS ANGELES INFORMATION TECHNOLOGY SERVICES IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA Summer 2013, Version 2.0 Table of Contents Introduction...2 Downloading the
Predictive Analytics Powered by SAP HANA. Cary Bourgeois Principal Solution Advisor Platform and Analytics
Predictive Analytics Powered by SAP HANA Cary Bourgeois Principal Solution Advisor Platform and Analytics Agenda Introduction to Predictive Analytics Key capabilities of SAP HANA for in-memory predictive
What is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling
MS4424 Data Mining & Modelling MS4424 Data Mining & Modelling Lecturer : Dr Iris Yeung Room No : P7509 Tel No : 2788 8566 Email : [email protected] 1 Aims To introduce the basic concepts of data mining
Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining
Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar What is data exploration? A preliminary exploration of the data to better understand its characteristics.
Advanced In-Database Analytics
Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??
Using Data Mining to Detect Insurance Fraud
IBM SPSS Modeler Using Data Mining to Detect Insurance Fraud Improve accuracy and minimize loss Highlights: combines powerful analytical techniques with existing fraud detection and prevention efforts
College Readiness LINKING STUDY
College Readiness LINKING STUDY A Study of the Alignment of the RIT Scales of NWEA s MAP Assessments with the College Readiness Benchmarks of EXPLORE, PLAN, and ACT December 2011 (updated January 17, 2012)
Rachel J. Goldberg, Guideline Research/Atlanta, Inc., Duluth, GA
PROC FACTOR: How to Interpret the Output of a Real-World Example Rachel J. Goldberg, Guideline Research/Atlanta, Inc., Duluth, GA ABSTRACT THE METHOD This paper summarizes a real-world example of a factor
DHL Data Mining Project. Customer Segmentation with Clustering
DHL Data Mining Project Customer Segmentation with Clustering Timothy TAN Chee Yong Aditya Hridaya MISRA Jeffery JI Jun Yao 3/30/2010 DHL Data Mining Project Table of Contents Introduction to DHL and the
Why is Internal Audit so Hard?
Why is Internal Audit so Hard? 2 2014 Why is Internal Audit so Hard? 3 2014 Why is Internal Audit so Hard? Waste Abuse Fraud 4 2014 Waves of Change 1 st Wave Personal Computers Electronic Spreadsheets
Using Predictive Analytics to Detect Fraudulent Claims
Using Predictive Analytics to Detect Fraudulent Claims May 17, 211 Roosevelt C. Mosley, Jr., FCAS, MAAA CAS Spring Meeting Palm Beach, FL Experience the Pinnacle Difference! Predictive Analysis for Fraud
Using Data Mining to Detect Insurance Fraud
IBM SPSS Modeler Using Data Mining to Detect Insurance Fraud Improve accuracy and minimize loss Highlights: Combine powerful analytical techniques with existing fraud detection and prevention efforts Build
IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH
IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH Kalinka Mihaylova Kaloyanova St. Kliment Ohridski University of Sofia, Faculty of Mathematics and Informatics Sofia 1164, Bulgaria
Using Adaptive Random Trees (ART) for optimal scorecard segmentation
A FAIR ISAAC WHITE PAPER Using Adaptive Random Trees (ART) for optimal scorecard segmentation By Chris Ralph Analytic Science Director April 2006 Summary Segmented systems of models are widely recognized
Take a Whirlwind Tour Around SAS 9.2 Justin Choy, SAS Institute Inc., Cary, NC
Take a Whirlwind Tour Around SAS 9.2 Justin Choy, SAS Institute Inc., Cary, NC ABSTRACT The term productivity can mean a number of different things and can be realized in a number of different ways. The
KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa Email: [email protected]
KNIME TUTORIAL Anna Monreale KDD-Lab, University of Pisa Email: [email protected] Outline Introduction on KNIME KNIME components Exercise: Market Basket Analysis Exercise: Customer Segmentation Exercise:
Business Intelligence and Process Modelling
Business Intelligence and Process Modelling F.W. Takes Universiteit Leiden Lecture 2: Business Intelligence & Visual Analytics BIPM Lecture 2: Business Intelligence & Visual Analytics 1 / 72 Business Intelligence
Data Exploration Data Visualization
Data Exploration Data Visualization What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping to select
Better planning and forecasting with IBM Predictive Analytics
IBM Software Business Analytics SPSS Predictive Analytics Better planning and forecasting with IBM Predictive Analytics Using IBM Cognos TM1 with IBM SPSS Predictive Analytics to build better plans and
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
Easily Identify Your Best Customers
IBM SPSS Statistics Easily Identify Your Best Customers Use IBM SPSS predictive analytics software to gain insight from your customer database Contents: 1 Introduction 2 Exploring customer data Where do
IBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com
SPSS-SA Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com SPSS-SA Training Brochure 2009 TABLE OF CONTENTS 1 SPSS TRAINING COURSES FOCUSING
Scatter Chart. Segmented Bar Chart. Overlay Chart
Data Visualization Using Java and VRML Lingxiao Li, Art Barnes, SAS Institute Inc., Cary, NC ABSTRACT Java and VRML (Virtual Reality Modeling Language) are tools with tremendous potential for creating
Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode
Iris Sample Data Set Basic Visualization Techniques: Charts, Graphs and Maps CS598 Information Visualization Spring 2010 Many of the exploratory data techniques are illustrated with the Iris Plant data
IBM SPSS Direct Marketing 22
IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release
Benefits of Upgrading to Phoenix WinNonlin 6.2
Benefits of Upgrading to Phoenix WinNonlin 6.2 Pharsight, a Certara Company 5625 Dillard Drive; Suite 205 Cary, NC 27518; USA www.pharsight.com March, 2011 Benefits of Upgrading to Phoenix WinNonlin 6.2
COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3
COMP 5318 Data Exploration and Analysis Chapter 3 What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping
A Demonstration of Hierarchical Clustering
Recitation Supplement: Hierarchical Clustering and Principal Component Analysis in SAS November 18, 2002 The Methods In addition to K-means clustering, SAS provides several other types of unsupervised
Data Mining is the process of knowledge discovery involving finding
using analytic services data mining framework for classification predicting the enrollment of students at a university a case study Data Mining is the process of knowledge discovery involving finding hidden
TIM 50 - Business Information Systems
TIM 50 - Business Information Systems Lecture 15 UC Santa Cruz March 1, 2015 The Database Approach to Data Management Database: Collection of related files containing records on people, places, or things.
1.1 Difficulty in Fault Localization in Large-Scale Computing Systems
Chapter 1 Introduction System failures have been one of the biggest obstacles in operating today s largescale computing systems. Fault localization, i.e., identifying direct or indirect causes of failures,
OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP
Data Warehousing and End-User Access Tools OLAP and Data Mining Accompanying growth in data warehouses is increasing demands for more powerful access tools providing advanced analytical capabilities. Key
CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19
PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations
Data mining and statistical models in marketing campaigns of BT Retail
Data mining and statistical models in marketing campaigns of BT Retail Francesco Vivarelli and Martyn Johnson Database Exploitation, Segmentation and Targeting group BT Retail Pp501 Holborn centre 120
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents
Operationalise Predictive Analytics
Operationalise Predictive Analytics Publish SPSS, Excel and R reports online Predict online using SPSS and R models Access models and reports via Android app Organise people and content into projects Monitor
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs
1.1 Introduction Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs For brevity, the Lavastorm Analytics Library (LAL) Predictive and Statistical Analytics Node Pack will be
dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING
dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING ABSTRACT In most CRM (Customer Relationship Management) systems, information on
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
White Paper. Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics
White Paper Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics Contents Self-service data discovery and interactive predictive analytics... 1 What does
How To Make A Credit Risk Model For A Bank Account
TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző [email protected] 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions
