CHAID Decision Tree: Reverse Mortgage Loan Termination Example Business Context Reverse Mortgage Loan (RML) enables Senior Citizens to avail of periodical payments from a lender against the mortgage of his/her house to supplement their income while remaining the owner and occupying the house. Interest on the payments availed will be accumulated. One of the types of Reverse mortgage is Home Equity Conversion Mortgage (HECM), insured by the Federal Housing Administration (FHA) and constituting over 90% of all reverse mortgage loans originated in the U.S. market 2. A HECM loan is terminated when the borrower dies or permanently moves out the house. Understanding termination outcomes of HECM loans is essential for the FHA insurance program and the long-term viability of the HECM program 2 The data is downloaded from HUD.GOV websire 3. All loans originated in 2003 and 2004 are considered for below example. If the termination date is populated, the HECM loan is considered as close. Age and of borrowers are used below example to illustrate Decision Tree Building process. CHAID Algorithm 1 Process Steps Step 1: Find best split for each Predictor or Independent Variable by merging categories of the Predictor variable Step 2: Compare Predicator Variables and select the best variable for the node and split the node into two child nodes Step 3: Continue Step 1 and Step 2 for each of the new nodes until satisfy the stopping criteria Step 1: Best Split for a Predictor Variable In this example Age of borrowers are considered as one of the predictor or Independent variable. Age is ordinal variable hence merging of continuous categories are allowed. Ordinal variables (without missing category) are referred as Monotonic Predictors. An ordinal variable with missing category is considered as Floating Predictor. In the floating predictors, floating category can be grouped with any other category. Since Age of borrower does not have any missing category, it is considered as Monotonic Predictor and categories will be compared and merged to the subsequent category only. RamG Data Analytics and Insights Page 1
1. Calculate Chi Square Statistics between two of the subsequent categories of the Predictor or Independent variable, Age of the Borrowers. Each value of borrower will be considered as separate category, but for simplicity the below category groups are created. Borrower Age Loan Terminate Low-65 <=70 <=75 <=80 <=85 <=90 <=95 95-High No 5100 7709 7922 5676 2306 531 75 19 Yes 3512 5741 7260 7578 4679 1918 776 200 Total 8612 13450 15182 13254 6985 2449 851 219 7.8 75.88 248.2 184.4 110. 7 69.77 0.004 2. Merge the categories which are the least significantly different In above categories, the categories <=95 and 95-High are least significantly different and will be candidates for merging. Below will be tables after merging. Borrower Age Loan Terminate Low-65 <=70 <=75 <=80 <=85 <=90 90-High No 5100 7709 7922 5676 2306 531 94 Yes 3512 5741 7260 7578 4679 1918 976 Total 8612 13450 15182 13254 6985 2449 1070 Chi-square between subsequent categories will be calculated again to find next least significantly different category group. 3. Continue category merging steps until two categories are left, the process also involve splitting categories of a group which has more than 2 categories. 4. Final Split for the variable, age of borrower Loan Terminate Borrower Age <=69 >69 No 11187 18151 Yes 8012 23652 Total 19199 41803 RamG Data Analytics and Insights Page 2
9233 20105 9966 21698 413 190 383 176 Chi Square Statistics 1161.95 Follow similar steps for other predictor variables. Consider of borrower. is nominal variable; hence any category can be clubbed with any other category. In the original paper, the nominal variables are referred as Free Predictors. Loan Terminate Couple Female Male Not Reported Total No 12397 13349 3543 49 29338 Yes 9895 16003 5645 121 31664 Total 22292 29352 9188 170 61002 First Merge Iteration Loan Terminate Couple Female Male Not Reported Total No 12397 13349 3543 49 29338 Yes 9895 16003 5645 121 31664 Total 22292 29352 9188 170 61002 11113 14633 11179 14719 148 113 147 112 Chi Square Statistics 520 11288 4652.374 11004 4535.626 109 265 112 271 Chi Square Statistics 757 12352 94 9940 76 0 22 0 27 RamG Data Analytics and Insights Page 3
Chi Square Statistics 49 12865 4027 16487 5161 18 58 14 45 Chi Square Statistics 136 13321 77 16031 93 0.059 10 0.049 9 Chi Square Statistics 19 3526.7 65.3 5661.3 104.7 0.1 4.0 0.0 2.5 Chi Square Statistics 6.7 Second Merge Iteration Loan Terminate Couple Female Male & Not Reported Total No 12397 13349 3592 29338 Yes 9895 16003 5766 31664 Total 22292 29352 9358 61002 11113 14633 11179 14719 148 113 147 112 Chi Square Statistics 520 11262 4727 11030 4631 114 273 117 278 RamG Data Analytics and Insights Page 4
Chi Square Statistics 783 12846 4095 16506 5263 20 62 15 48 Chi Square Statistics 145 Final Split for the predictor variable, of borrower Loan Terminate Couple Male, Female & Not Reported Total No 12397 16941 29338 Yes 9895 21769 31664 Total 22292 38710 61002 10721 18617 11571 20093 262 151 243 140 Chi Square Statistics 795 Step 2: Selecting Best Predictor Variable for Node split The most discriminate variable based on Chi Square Statistics will be selected to split the parent node to child nodes. In the above example, if Age and of borrowers are only two predictor variables, based on Chi Square Statistics (Age Chi-Square 1162 and Chi-Square 765), age is selected as variable to split the parent node. Main Data Yes 31664 52% NO 29338 48% Age of Borrower Age<=69 Age>69 Yes 8012 42% Yes 23652 57% NO 11187 58% NO 18151 43% RamG Data Analytics and Insights Page 5
References 1. G.V.Kass, An exploratory technique for investigating large quantities of categorical data, Applied Statistics 2. Tonja Bowen Bishop, Hui Shan, Reverse Mortgages: A Closer Look at HECM Loans 3. http://portal.hud.gov RamG Data Analytics and Insights Page 6