How To Create A Text Mining Model For An Auto Analyst



Similar documents
How Organisations Are Using Data Mining Techniques To Gain a Competitive Advantage John Spooner SAS UK

The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

Paper Downtime of a truck = Truck repair end date - Truck repair start date

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

2015 Workshops for Professors

IBM SPSS Data Preparation 22

What's New in SAS Data Management

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Dynamic Decision-Making Web Services Using SAS Stored Processes and SAS Business Rules Manager

How To Develop Software

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Text Analytics Illustrated with a Simple Data Set

Using Edit-Distance Functions to Identify Similar Addresses Howard Schreier, U.S. Dept. of Commerce, Washington DC

The Scientific Data Mining Process

Leveraging Ensemble Models in SAS Enterprise Miner

Internet/Intranet, the Web & SAS. II006 Building a Web Based EIS for Data Analysis Ed Confer, KGC Programming Solutions, Potomac Falls, VA

Paper Robert Bonham, Gregory A. Smith, SAS Institute Inc., Cary NC

COC131 Data Mining - Clustering

White Paper. Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices.

Why is Internal Audit so Hard?

Strengthening Diverse Retail Business Processes with Forecasting: Practical Application of Forecasting Across the Retail Enterprise

Data Mining Techniques

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Paper Merges and Joins Timothy J Harrington, Trilogy Consulting Corporation

Alex Vidras, David Tysinger. Merkle Inc.

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

ORACLE ENTERPRISE DATA QUALITY PRODUCT FAMILY

This software agent helps industry professionals review compliance case investigations, find resolutions, and improve decision making.

KnowledgeSEEKER Marketing Edition

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

InfiniteInsight 6.5 sp4

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING

Advanced In-Database Analytics

1 Choosing the right data mining techniques for the job (8 minutes,

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

Customer Analytics. Turn Big Data into Big Value

Analyzing the Server Log

Make Better Decisions with Optimization

Rachel J. Goldberg, Guideline Research/Atlanta, Inc., Duluth, GA

Normalized EditChecks Automated Tracking (N.E.A.T.) A SAS solution to improve clinical data cleaning

Reevaluating Policy and Claims Analytics: a Case of Non-Fleet Customers In Automobile Insurance Industry

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

Introduction to SAS Business Intelligence/Enterprise Guide Alex Dmitrienko, Ph.D., Eli Lilly and Company, Indianapolis, IN

Chapter 3: Data Mining Driven Learning Apprentice System for Medical Billing Compliance

Easily Identify Your Best Customers

Short-Term Forecasting in Retail Energy Markets

Data Mining Applications in Higher Education

ANALYTICS IN BIG DATA ERA

How To Make A Credit Risk Model For A Bank Account

SAS In-Database Processing

ON-BOARDING WITH BPM. Human Resources Business Process Management Solutions WHITE PAPER. ocurements solutions for financial managers

Building a Data Quality Scorecard for Operational Data Governance

Taking EPM to new levels with Oracle Hyperion Data Relationship Management WHITEPAPER

Instructions for Analyzing Data from CAHPS Surveys:

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

Modeling Lifetime Value in the Insurance Industry

Performing a data mining tool evaluation

ing Automated Notification of Errors in a Batch SAS Program Julie Kilburn, City of Hope, Duarte, CA Rebecca Ottesen, City of Hope, Duarte, CA

The Data Mining Process

Paper PO 015. Figure 1. PoweReward concept

How To Use Statgraphics Centurion Xvii (Version 17) On A Computer Or A Computer (For Free)

Harnessing the power of advanced analytics with IBM Netezza

A MULTIVARIATE OUTLIER DETECTION METHOD

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

SAS Does Data Science: How to Succeed in a Data Science Competition

Migrating to vcloud Automation Center 6.1

Counting the Ways to Count in SAS. Imelda C. Go, South Carolina Department of Education, Columbia, SC

SAS Enterprise Guide in Pharmaceutical Applications: Automated Analysis and Reporting Alex Dmitrienko, Ph.D., Eli Lilly and Company, Indianapolis, IN

Graph Mining and Social Network Analysis

Customer Classification And Prediction Based On Data Mining Technique

Introduction. Background

9.2 User s Guide SAS/STAT. Introduction. (Book Excerpt) SAS Documentation

SAS BI Dashboard 4.3. User's Guide. SAS Documentation

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Data Preparation Part 1: Exploratory Data Analysis & Data Cleaning, Missing Data

A Property & Casualty Insurance Predictive Modeling Process in SAS

Marketing Mix Modelling and Big Data P. M Cain

M15_BERE8380_12_SE_C15.7.qxd 2/21/11 3:59 PM Page Analytics and Data Mining 1

Top 10 Reasons to Automate your IT Run Books

Automate Data Integration Processes for Pharmaceutical Data Warehouse

Intelligent Log Analyzer. André Restivo

Lost in Space? Methodology for a Guided Drill-Through Analysis Out of the Wormhole

SAS Software to Fit the Generalized Linear Model

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing

The GeoMedia Fusion Validate Geometry command provides the GUI for detecting geometric anomalies on a single feature.

Crime Pattern Analysis

Oracle Data Miner (Extension of SQL Developer 4.0)

Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems

Transcription:

Paper 003-29 Text Mining Warranty and Call Center Data: Early Warning for Product Quality Awareness John Wallace, Business Researchers, Inc., San Francisco, CA Tracy Cermack, American Honda Motor Co., Inc., Torrance, CA ABSTRACT An early warning system was implemented that leverages SAS/Text Miner and SAS/QC. While analysts already used comment fields to further understand automotive problems under investigation, an early warning system was tasked to monitor incoming free form text data in a systematic fashion. Clustering models serve to group similar warranty claims together. Once the automotive analyst has reviewed the cluster definition, all new data about that particular auto model will be scored into the best cluster. Cluster sizes and variances are monitored weekly and deviations from expected values are flagged for human review. INTRODUCTION This paper discusses the process of creating text-based clustering models and monitoring the clusters for change over time. Using textual data to define a unit of analysis offers a different view into data that has already received significant attention. TERMINOLOGY Some of the terms used within this paper are: Document: One record of text generated from a phone call or warranty claim. Document Collection: All of the documents that belong to a particular database, auto model and model year. Modelset: The document collection used to build a clustering model. Scoreset: The newly collected documents that will be scored and added to the document collection. DATA SOURCES The text field of several different databases is collected for analysis. Each of the sources differs in the vocabulary used and the types of issues discussed. 1. Warranty: when dealers complete warranty service claims, a comment field is available to further describe the problem. 2. Customer Relations: the call center logs parts of conversations and written communications with customers. 3. Techline: calls from dealer service technicians to specialized mechanics create more text data. There is no requirement on the text s format in each database; any combination of characters is allowed. WORKFLOW The Early Warning Process begins by collecting the data for a modelset (see Figure 1). Modelsets included all of the records (a minimum of three months) in the database about a specific car. The statistical models were separated by the three types of data listed above, as well as car model and model year. Once the clustering model has completed the process described below, the model is put into production and it begins to score new data. On a weekly basis, new data is scored and assigned to the best cluster. The week s assignments are aggregated and analyzed using SAS/QC. Any departure from the expected aggregate values is flagged and an alert is included in the weekly alert email. 1

Early Warning Process Flow Data Collection Text Mining Model Review Score Data Collection Statistical Process Control Alerts/ Reporting Figure 1. CLUSTERING PROCESS Figure 2 shows the flow of documents through the text mining process. The step Term by Document Matrix is where words are converted to numbers (see Getting Started with Text Miner for more detail). The size and sparseness of the term by document matrix is driven by the size and complexity of the document collection. Text Miner Process Flow Parsing Stemming Synonyms StopList Term by Document Matrix Singular Value Decompostion Clustering Figure 2. The inputs to the clustering algorithm are the outputs from the Singular Value Decomposition: the SVD document vectors. The number of vectors needed to best approximate that matrix typically ranges from 10 to 100. In Text Miner, the parameter resolution is applied in order to decide how many of the vectors to use for clustering. In this project, up to 70 dimensions were passed to the clustering algorithm as input. A space with so many dimensions is very empty; the observations are pushed to the edges of the projected space. In essence, the clustering algorithm is identifying intersections on the edges where there are groups of observations. Figure 3 depicts what this space may look like. Clusters of varying sizes and varying distances from one another inhabit the edges of the input space. 2

Clusters on the Outer Edges of the Input Space Figure 3. Some experimentation during early development, with both K-means (PROC FASTCLUS) and Expectation- Maximization (PROC EMCLUS) clustering, led to the selection of Expectation-Maximization clustering for all models. The decision was based on the quality of the models as defined in the Application section below. Once the statistical modeler has built an initial model, the modeler presents the clusters to the domain expert: the auto analyst. Lastly, after the auto analyst makes any changes, the final clustering model becomes the baseline knowledge about all of the free-form text for that car. APPLICATION The most general objective of clustering is to form groups of similar records. The objective of grouping documents together in a meaningful fashion can only be met when the auto analyst finds the clusters useful. In this case, the usefulness of a cluster is subjective but focuses on the homogeneity of the documents within it. The domain expert assesses the homogeneity of a cluster as to how well that cluster describes a particular engineering problem. For example, if every document in a cluster refers to the headliner, it is more homogeneous then a hypothetical cluster with documents about the bumper and the headliner. Since a document collection may have hundreds of different topics discussed in a one-year period, tens or hundreds of clusters may be necessary in order to achieve homogeneity among documents in a cluster. It was recognized early on that the best initial clustering model would not meet the auto analyst s expectations. In fact, it was determined that a series of models would be required to satisfy their needs. It was considered a remarkable success if the analyst was pleased with the clusters containing 75% of the documents in the modelset. Based on the auto analyst s domain expertise, clustering models were tuned for homogeneity using two techniques: subclustering and merging clusters. Subclustering can be defined as using the documents assigned to a cluster in a first model as the modelset for a subsequent model. The subcluster model had a much smaller term by document matrix, and therefore allowed documents that were once viewed as similar to be grouped separately. In the case of a catch all cluster, subclustering was necessary because the parent cluster lacked homogeneity, as in the bumper and headliner example above. In other cases, a cluster that was clearly about leaks was broken into different types: oil, water, etc. Lastly, if the model identified a cluster about cars with a dead battery and another cluster about cars being jump started, it was easy to merge the two clusters together outside of Text Miner. 3

TEXT MINING UTILITIES At first glance, text mining seems to contradict the adage that data mining is 80% data preparation and 20% modeling. The raw text itself is the input to the algorithms. However, two other important inputs end up taking considerable time to develop: the synonym list and stop list. Just as in data mining the fitting of lines to the data is arguably less important than understanding the problem and creating useful input variables, in text mining creating the synonym and stop lists is more important than the selection of clustering algorithms. These lists are the heart (or brains) of any Text Miner model. Due to the number of abbreviations and auto-specific terms, this project s synonym list has grown to over 30,000 entries. During the course of developing the first models, we opted to build some utilities to extend the functionality of Text Miner using Base SAS and SAS Macro. Several of these extensions were related to synonym and stop list creation. A partial list of the utilities developed follows: 1. Data Preprocessor. Particularly dirty data may require the removal of unwanted characters using rules other than those employed by Text Miner. The portion of the macro shown below addresses the conditional removal of a slash and removes incomplete records. The slash is used to create the word A/C but may also be used to join two separate words like repaired/installed. The conditional removal is necessary, as A/C cannot become A C. If repaired/installed is left alone, it means neither repaired nor installed, but a new third term to Text Miner. The algorithm in this macro works in this order: a. Count up to the first three words in the string. b. If there are less than three, delete the record. c. Look for a slash at the beginning and end of the string and remove it. d. From the beginning of the string, identify the first occurrence of a slash. e. Count the number of characters in front of the slash. f. If three or more characters, reconstruct the string with a space in place of the slash. g. If the string preceding the slash is shorter (i.e. A/C), skip and continue searching in the rest of the string. %macro RemoveUnwantedCharacters(DsIn=, TextField=); Code removed here data &DsIn (drop=location location2 count i); set &DsIn; wordcount=0; do i=1 to 3; if scan(&textfield, i, " ") ne "" then wordcount+1; if wordcount <=2 then delete; if substr(reverse(&textfield),1,1)="/" then &TextField=reverse(substr(reverse(&TextField),2)); if substr(&textfield,1,1)="/" then &TextField=substr(&TextField,2); count=1; location=index(&textfield,"/"); if location > 0 then do until (location2=0); if length(scan(reverse(scan(&textfield, count, "/")),1," ")) > 2 then do; &TextField=trim(substr(&TextField, 1, location-1))!!" "!!trim(substr(&textfield, location+1)); else do; count+1; location2=index(substr(&textfield,location+1),"/"); location=location+location2; run; %m 4

2. Dataset Extractor. This routine extracts term lists from Text Miner into Excel for human review. On the way into Excel, these lists were then processed by other SAS macros not detailed here. 3. History Recorder. This step maintains a master term list so that only new words (i.e. words that have not been seen before) in new documents are reviewed. 4. Text Helper. The human decision process of adding terms to the synonym and/or stop list is facilitated with Excel-based (Visual Basic) macros. 5. Synonym Integrity Checker. A macro analyzes the integrity of the synonym list in order to remove inconsistencies (i.e. a term that has multiple parents) that can have an unpredictable impact on Text Miner. Any exceptions are presented to the operator for correction and corrections are integrated into the synonym and stop lists. These checks are required often because the synonym and stop lists are updated each time a new model is built. 6. Nuisance Record Suppressor. Adding terms to the stop list removes the term from cluster definition. However, sometimes it is meaningful to remove the whole document. Whatever types of records were suppressed in the modeling stage also need to be suppressed during the scoring process. 7. Model Reporter. Having the top terms for every cluster and subcluster in a file became very useful when reviewing models with the domain experts. SCORING The main facility for scoring provided in Text Miner on SAS 8.2 is scoring from the Enterprise Miner interface. However, all of the necessary score code to score new data in batch is available in the score node. Some of the modeling decisions, such as subclustering and suppressing observations, called for careful creation of Enterprise Miner diagrams and several enhancements to the Text Miner score code. After some experimentation, there was success in gathering the score code from a diagram with multiple models and scoring data outside of the EM interface. Simplifying the process of deploying the score code was critical since nearly 100 clustering models would be put into production. MONITORING Building clustering models was a means to an end: to monitor incoming text data. The monitoring process was tasked with finding potential problems and to alert auto analysts about them. The initial set of monitoring processes and reports are described below. CHANGE-IN-SIZE ALERTS The change-in-size alert monitors cluster sizes on a weekly basis and signals abnormal growth by utilizing the p- charting capability of SAS/QC s PROC SHEWHART. In this case, the proportion of total records in a specific cluster took the place of the traditional proportion nonconforming, and the week s total number of records became the varying sample total. PROC SHEWHART calculates the appropriate control limits (3sigma) from the data depending on the variance of the clusters proportion and total sample size. The procedure also saves calculated limits and reads them back in during subsequent weeks. Analysts are alerted about any cluster for which one or more weeks saw the cluster s proportion of total records above the upper control limit. Figure 3 contains some sample output. The name of the cluster has been changed to bumper to match the hypothetical example above. In addition, if five or more periods in a row are trending in the same direction, an alert is also generated. Although this is not the classic application of a p-chart, its capabilities have proven solid. One concern early on was that the proportion that was being charted was a percent of total, not a simple proportion conforming. The zero-sum nature of a percent of total may have been problematic, but the large number of clusters (>100) softened the impact of one cluster s growth, reducing a constant cluster s proportion of total. NEW WORDS ALERTS The new words alert is just that: the words in the week s data that have not previously appeared in the document collection. This report was designed to mitigate a property of scoring with a clustering model based on textual data: only the words used to build the model can be used to relate new data back to the model. This is analogous to not being able to add a new parameter to a regression model without re-estimating the regression. The new word processing checks the synonym list and suppresses alerts where just another synonym of a known parent appeared. 5

Figure 3. CHANGE IN SHAPE ALERTS This alert was designed early in the project to mitigate non-detection of a change in size when clusters have more than one concept (i.e. bumpers and headliners). In theory, a simultaneous reduction in the number of claims for bumpers, and an increase in claims for headliners, would cancel one another. Although watching the size of the cluster over time would not detect the increase in headliner claims, it was felt that the distribution of types of documents within the cluster had changed. The report required the calculation of the cumulative Root Mean Square Standard Deviation (RMSSTD) of each cluster each week. By reviewing some empirical plots of RMSSTD from the first scored data, it was determined that either large increases or decreases in variance signaled a change in the data, meriting the attention of the respective auto analyst. Representing the change in the data that RMSSTD identified presented a new challenge. Figure 4 is a two-dimensional representation of what the change in shape might look like. The outside of the sphere represents the cluster boundary and the masses inside represent the location of documents. Although a change in the distribution of terms within a cluster caused the changing variance, trying to count and track the terms over time to explain the change was not attempted. Instead, frequency lists of the part numbers for the records were produced and a comparison was made between the current and previous periods. If the only change was the language used to describe a problem, an auto analyst could deem the alert a false positive. The decision discussed above -- to make more homogeneous clusters by increasing the number of clusters -- diminished the need for monitoring the cluster variance. However, change in variance may still prove useful for particularly large clusters or clusters where homogeneity is in question. 6

Week 5 Week 6 Figure 4. CONCLUSION The initial results of this methodology of deploying SAS/Text Miner and SAS/QC have been very positive. Systematic analysis has been enabled for text data that was too large to attempt to read. Working directly with the domain experts has increased both the usefulness of the models and their acceptance within the company. The process continues to evolve. Further automation has reduced the time required to bring a new clustering model into production and new ideas on increasing the homogeneity of clusters will be tested. REFERENCES SAS Institute Inc., SAS/STAT Users Guide, Version 8, Volume 1, Cary, NC: SAS Institute Inc., 1999. SAS Institute Inc., Getting Started with SAS Text Miner, Cary, NC: SAS Institute Inc., 2002. ACKNOWLEDGMENTS The authors would like to recognize SAS Professional Services, the prime contractor on this project. A special thanks also goes to those who reviewed this paper: Benjamin Scott (UC Berkeley/Business Researchers), Scott Carl (Tricision, Inc.) and Philip Corrin (Business Researchers). CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the authors at: John Wallace, Principal Business Researchers, Inc. 74 Mallorca Way San Francisco, CA 94123 Work Phone: 415-377-4759 Email: jwallace@businessresearchers.com Web: http://www.businessresearchers.com Tracy Cermack American Honda Motor Co., Inc. 1919 Torrance Blvd., 500-2S-1B Torrance, CA 90501-2746 Phone: 310-783-3515 Email: Tracy_Cermack@ahm.honda.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 7