Modifying Insurance Rating Territories Via Clustering
|
|
|
- Raymond Mitchell
- 10 years ago
- Views:
Transcription
1 Modifying Insurance Rating Territories Via Clustering Quncai Zou, New Jersey Manufacturers Insurance Company, West Trenton, NJ Ryan Diehl, New Jersey Manufacturers Insurance Company, West Trenton, NJ ABSTRACT In the United States, most property and casualty lines of insurances use territories to incorporate the geographical risk into their rating structures. The geographical rating territories in theory are based on homogeneous groupings of geographical areas. Based on expected levels of loss per exposure for each grouping, a rate can be determined. For many insurance companies, territory relativities are frequently updated because of loss experience or competitive market forces; however, often territory boundaries have not changed significantly over the years. This is a problem of cluster analysis. Several procedures in SAS can be used to create a more homogeneous territory definition which is more commensurate with the company s own experience. However, these clustering procedures do not naturally produce contiguous groupings (i.e. groupings that only include geographic units that are adjacent to each other). Since contiguous territorial boundaries are desired, a contiguity constraint needs to be added to the clustering routine. There are no readily available procedures in SAS that can conveniently apply the contiguity constraint. Therefore the author attempted to use SAS macros to combine the principle of locality and ward method, aiming at modifying the current territories. Comparison of homogeneity of the new and old territories indicates that homogeneity is significantly improved. INTRODUCTION In the United States, most property and casualty lines of insurances use territories to incorporate the geographical risk into its rating structures. Insurers charge varied premium based on the rating territories. The geographical rating territories in theory are based on homogeneous groupings of geographical areas. Companies typically define territories as a collection of small geographic units (e.g., zip codes, counties or census blocks) with similar expected loss costs. Territories are necessary to differentiate individual risks on top of other rating variables since geography is considered one of the primary drivers of claims experience. A rate can be determined for each grouping based on expected losses per exposure (a.k.a. pure premium). For many insurance companies, territory relativities are updated frequently to reflect recent loss experience or competitive market forces; however, traditional territory definitions are not changed significantly over the years. This is due to the complexity of determining new territorial definitions, as well as the rate dislocation that entirely new rate boundaries may cause existing policyholders. Because of the changing environment (new housing development, changes in commuting patterns, demographic changes, etc), boundaries which may have been relevant decades ago may have lost their meanings over time. Therefore it is necessary to modify the current territory definitions so as to appropriately reflect the geographical risk. One method for the modification is to overhaul the current territory classification using historical loss experience by applying various cluster analysis techniques. SAS procedure PROC CLUSTER is readily available to achieve this, which is promising to create more homogeneous territories that commensurate with the company s own experience. It is important to note, however, that these types of clustering routines do not naturally produce contiguous groupings (i.e., groupings that only include geographic units that are adjacent to each other). Since contiguous territorial boundaries are desired here, a contiguity constraint needs to be added to the clustering routine. There are no readily available procedures in SAS that can conveniently apply the contiguity constraint. In this paper, the author attempted to revisit the current territory definitions by SAS macro programming. The idea of principle of locality is incorporated in the macros to create more credible pure premium for each Geographical unit. The cluster analysis carried out in this paper uses Ward s minimum variance method, which is described below. 1
2 DEFINITIONS Several definitions are necessary before going forward. These are standard terms used in property and casualty ( P&C ) insurance, and are provided for non-insurance professionals. Exposure the basic unit of risk that underlies the insurance premium. For example, one house insured for one year represents one exposure for homeowners insurance. For this paper, amount of insurance purchased is used as the exposure. Rate the expected cost per unit of exposure. The rate must be sufficient to cover losses, expenses and a profit and contingency provision. Premium the final cost of the policy, which is equal to the product of exposure units and rate. Loss cost or pure premium the expected amount of losses per exposure unit; the goal of territorial ratemaking is to establish territorial rates that are closely correlated to expected loss costs. DATA AND METHODOLOGY GEOGRAPHICAL UNIT Before any data preparation, the first step is to determine the geographical units. The unit should be refined enough to be relatively homogenous with respect to geographic differences while still having some observations in most units. The unit should also be easily accessible. Typical units are zip codes, census blocks, counties, or some combination of these. Each of these options has practical advantages and disadvantages. For example, while zip codes have the advantage of being the most readily available, they have the disadvantage of changing over time. Counties have the advantage of being static and readily available; however, due to the large size of most counties, they tend to contain very heterogeneous risks. Census blocks are relatively static over time, but require a process to map insurance policies to the census blocks. Postal zip codes are often used as the building blocks for territory definition mainly due to their availability. For the example presented in this paper, five years of Homeowners accident year data at zip code level for New Jersey including exposures and pure premium were obtained. Average annual pure premiums excluding the highest and lowest were calculated for each zip code to remove the possible outliers in the data. Table 1 Typical Source Dataset index Code X Y PPAOI_MID3 AOI_Exposures Neighbors n1 n2 n $ $ $ $ $ $ $ $ A typical dataset used in the example is shown in table 1. There are five principal variables code (zip code), x (longitude), y (latitude), PPAOI_MID3 (pure premium), AOI_Exposure (exposure, adjusted for amount of insurance). In addition, to apply the contiguity constraint, the corresponding neighboring information is also retained for each zip code index is a unique assigned number to each zip codes; neighbors represents number of neighboring zip codes for the corresponding zip code; while n1, n2,, n16 are the unique indexes for the neighboring zip codes. For example, zip code (index = 1) has 7 neighbors which have indexes 7 (zip 2
3 code 07008), 32 (zip code 07036, not shown in the table), etc. An adjacency table, which contains every pair of zip codes that are in contact, was used for the neighboring zip code information. PRINCIPLE OF LOCALITY The data can be thin for some of the zip codes shown in table 1. Generally the smaller the exposures for a zip code, the less credible it becomes. According to the principle of locality, the expected loss experience at a given location is similar to the loss experience nearest to that location. One may use similar idea to improve the estimate of any individual unit by using information from nearby units. This is sometimes referred to as spatial smoothing. The creation of credibility-weighted pure premium for each zip code is a way of utilizing spatial smoothing techniques. The procedure starts out with a pure premium for each zip code. Then for each zip code, the latitude and longitude of the centroid of each zip is used to determine the group of zip codes whose centroid is within 5-, 10-, 15-, 20-, 25-, 50-mile radius of this zip code. A pure premium was calculated for each of the six groups. The statewide average pure premium was also calculated. For each zip code, the next step is to assign credibility to each zip code premium and the six groupings associated with that grouping. The credibility value was calculated P using earned premium and the formula z = where z is the credibility assigned, P is the Earned P + K Premium and K is the credibility constant of $2,500,000. For the 5-mile radius grouping, the credibility assigned to the zip code was subtracted out to get the credibility assigned to this grouping s pure premium. For example if the zip code s credibility is 0.60 and credibility of the zip codes within 5-miles was 0.80, then 0.20 credibility was assigned to the 5-mile radius grouping. For the 10-mile radius grouping, the credibility previously assigned to the zip code and the 5-mile radius grouping were subtracted out of the formula credibility for the 10-mile grouping to get the credibility for the 10-mile radius grouping; and so on and so forth. If the sum of the assigned credibilities was not at 100%, then any remaining credibility was assigned to the statewide average pure premium1. Now the credibility weighted average pure premium can be calculated as: PP z PP + z PP + z PP + z PP + z PP + z PP + z PP + z = PP overall where, PP is the credibility weighted pure premium, z x ( x can be 0, 5, 10, 15, 20, 25, 50, overall) is the credibility assigned to the corresponding radius group, and PP is the pure premium for each radius group. The credibility formulas discussed above are just one among many choices for credibility and complement of credibility. This method is designed to pick up the information from surrounding geographical areas of the zip codes. For most zip codes in this study, almost all the credibility was assigned within a 10-mile radius. x METHODOLOGY Several general types of cluster analysis methods exist. For each of these general types there are a number of specific methods and most of these cluster analysis methods can use a wide array of similarity or dissimilarity measures. As in any cluster analysis problem, many different ways may be used to create the clusters. In any case, an appropriate similarity measure must be chosen first. A similarity measure is a way to tell how similar two units or two groups of units are to each other. Euclidean distance is the most often used similarity measure between two units; while centroid method (involving finding the mean vector location for each of the clusters and taking the distance between these two centroids), single linkage (the distance between the closest members of the two clusters), complete linkage (the distance between the farthest apart members), average linkage (involving looking at the distances between all pairs and averages all of these distances) are most often used to measure how close two clusters (groups of units) are to each other. 3
4 Once the measure of association as well as the method for determining the distances between clusters has been considered, one generally has two methods to proceed agglomerative or divisive clustering. For agglomerative clustering, one starts out with all sample units in N clusters of size 1; then at each step of the algorithm, the pair of clusters with the highest similarity is combined into a single cluster. The algorithm stops when all sample units are combined into a single cluster of size n. For divisive clustering, one starts out with all sample units in a single cluster of size N. Then, at each step of the algorithm, clusters are partitioned into a pair of sub-clusters, selected to minimize the similarity between each sub-clusters. The algorithm stops when sample units are partitioned into N clusters of size 1. Another alternative approach for performing cluster analysis is the Ward s method which is the method used in this paper. Basically it looks at cluster analysis as an analysis of variance problem, instead of using distance metrics or measures of association. It involves an agglomerative clustering algorithm. It will start out at the leaves and work its way to the trunk. It looks for groups of leaves that it forms into branches, the branches into limbs and eventually into the trunk. Ward's method starts out with N clusters of size 1 and continues until all the observations are included into one cluster. This method is most appropriate for quantitative variables. For this particular method, three quantities need to be defined just as in ANOVA: Error Sum of Squares: ESS = i j 2 ( X ij X i. ). Here we are summing the squared difference between the unit and its cluster over all of the units within each cluster. We are essentially comparing the individual observations for each variable against the cluster means for that variable. A small ESS suggests that our data are close to their cluster means, implying that the units in each cluster have high similarities. overall mean. Between Sum of Squares: BSS = Total Sum of Squares: TSS = data, which is the sum of ESS and BSS. i i j j. ) 2 ( X i X... Here we are comparing the cluster mean and the ) ( 2 X ij X... TSS measures the total variation presented in the The homogeneity Measure chosen in this paper is the percentage of within variance as of the total variance (%variance), which is defined as error sum of squares / total sum of squares. So if the %variance is small, then the territories have high similarities. Let s consider two extremes: Each individual zip code as its own cluster, i.e., if there are N zips, there will be N clusters. The ESS would be 0 since the individual zip level value is the same as its cluster mean, which is the zip code level value. So the homogeneity measure would be 0. This is the most homogeneous case. All zips are combined into one cluster. The ESS would be the same as the TSS according the definition. So the homogeneity measure would be 1 in this case. This is the least homogeneous situation. Therefore the percent variance is able to reasonably measure the homogeneity of the territory definition. Using Ward's Method we will start out with all zip codes in N clusters of size 1 each. In the first step of the algorithm, n - 1 clusters are formed, one of size two and the remaining of size 1. The ESS is then computed. The pair of zip codes that yield the smallest ESS thus the smallest %Variance will form the first cluster. Then, in the second step of the algorithm, N - 2 clusters are formed from that N - 1 clusters defined in step 2. These may include two clusters of size 2, or a single cluster of size 3 including the two zip codes clustered in step 1. Again, the value of ESS is minimized. Thus, at each step of the algorithm clusters or observations are combined in such a way as to minimize the results of ESS. The algorithm stops when all zip codes are combined into a single large cluster of size N. 4
5 APPLICATION AND RESULTS APPLICATION Most part of the work is done in SAS via SAS macro programming. The SAS macro language is a very versatile and useful tool. It is especially useful if the same code will be run multiple times. It facilitates passing information from one procedure to another. Furthermore, it can also be used to write SAS programs that are dynamic and flexible. Using nested macros, one can better organize complicated SAS programs with SAS macro modules. Two main jobs need to be done in this practice. First, credibility weighted pure premiums need to be calculated. Secondly, apply Ward s method on the weighted pure premiums to create contiguous grouping of zip codes. Four macro functions (shown in Appendix section) were written to do the job. The calling sequence of these macros is shown in Graph 1. The macro %distance () in Appendix A is called by and %credibility_weighted_premiums() in Appendix B to create credibility weighted pure premiums for each zip code. The macro %ward_diff () in Appendix C is designed to calculate the ward difference between two clusters. Both macros in Appendix B and C are called in the main macro %ho_territory() (Appendix D) to create homogeneous groupings of zip codes based on the Ward Method. %ho_territory() Appendix D %Credibility_Weighted_Premiums() Appendix B %ward_diff() Appendix C %distance() Appendix A Graph 1 - Calling sequence of the macro functions To calculate the credibility-weighted pure premiums, the first step is to determine the distance between zip codes. As discussed in the methodology section, the distance between any individual ZIP code and the remaining ZIPs need to be calculated to define the 6 difference radius groups. A macro function distance is defined in Appendix A, which requires five parameters. Long1 and lat1 are the longitude and latitude of the centroid of the first ZIP, Long2 and lat2 are the longitude and latitude of the centroid of the second ZIP, and R is the radius of the earth. By calling this macro function with the required parameters provided, one can calculate the distance between any two points on the earth. The macro function defined in Appendix B - credibility_weighted_premium produces the credibility weighted pure premiums for each zip code. This macro function calls the macro in Appendix A and involves two parameters, indata and outdata. The typical source data shown in the Data and Methodology section can be used here as an input for indata. outdata can be any valid SAS dataset name. The output from calling the macro function credibility_weighted_premium is a SAS dataset similar to the one shown in Table 1, with one more column added, pp_wtd (the credibility weighted pure premium). The value of this variable for each ZIP code, along with the exposure and unique index, is saved in array of macro variables (resp1 respn for the credibility weighted pure premium; weight1 weightn for the exposure; index1 indexn for the unique index). These values will be referred to in subsequent macro calls. In Appendix C, another macro function is defined to calculate the Ward difference between two units. The codes in this macro will be used many times in the main macro HO_TERRITORY (called in do loops). By separating the codes into a macro function, one can not only reuse the same code multiple times, but also make the program 5
6 better organized and easier to read. This macro function has 4 input parameters, n1, n2, n_index1, n_index2. n1 ( n2 ) is a character string with the indexes for the zips in 1st (2nd) cluster and n_index1 ( n_index2 ) is the number of zips in the 1st (2nd) cluster. The main macro function HO_TERRITORY is defined in Appendix D. It has three parameters, indata, n_terr, limit ; indata is the same as that used in the macro in Appendix B. n_terr is the number of groupings that is desired to create. For example, if one would like to create 30 territories, then 30 should be provided for parameter n_terr. And extra parameter limit is introduced to offer the ability to apply a constraint (maximum number of loss exposures in each grouping) when creating the groupings. One might find this useful if by chance one grouping grows too big. RESULTS This methodology was applied to New Jersey homeowners experience. The results of this particular study are shown in Graph 1. The graph depicts the results for a current territory structure (labeled Current ) and the modified territory structure (labeled Modified ) using the method described in this paper. Before the modification, there are 32 territories for which the total %Variance is as high as 70%. From the graph, we realize that we are able to reduce the within cluster variance percentage to 25%, holding constant the number of territories at 32. Graph 1 also depicts the %Variance for different number of territories. As expected, by increasing the number of territories - thus increasing similarity within each territory - the %Variance falls. The graph also helps the user to make the decision as to how many territories should be created based on the acceptable %Variance. For example, if a company would like to accept 10% of variance for its territory structure for a certain line of business, then about 80 territories should be defined. The number of territories will often depend on implementation concerns, including systems limitations, policyholder disruption, and the competitive landscape. State New Jersey Homeowners Insurance Pure Premium Percentage of Within Variance 90 Percentage of Within Variance Modified Current Number of Territories Graph 2 Comparison of %variance for the current and modified territory definitions To avoid substantial rate disruption for all the policyholders, one might use this program to revisit the current territories. The program can tell the percentage of variance for each existing territory. A ranking from high to low of the percentage of variances for each territory provides useful guide for the analyst as to which territories need the most attention to modify first. The territories on the top of the list might need to be split. The program can then be run on that territory only to suggest meaningful splits. Graph 3 shows an example of split suggested by the program for one of the existing territories. 6
7 Graph 3 Example of split the current territory CONCLUSION AND DISCUSSION This paper has presented an application of one technique to define geographical rating territories using SAS macros. To supplement geographical units with sparse data, the principle of locality was incorporated to create the credibility weighted pure premium for each unit. Ward method is used to perform the cluster analysis. Detailed SAS codes were provided the Appendix. The program included in this paper uses postal zip codes as the geographical units. This is not the only possibility however. The program can handle other geographical units such as county with little modification. Ward method is used to do the cluster analysis because it tends to create smaller clusters, which is a desired property here. Other cluster techniques can certainly be applied as well with appropriate adjustment. The results shown in the paper suggest that the homogeneity measure (%Variance) could be significantly improved (from 70% to 25%). However one should keep in mind that the results were obtained under an ideal condition, i.e., no restrictions. In practice, considerations should be given to competitive concerns, rate disruption, etc. The final proposed territory boundaries could be different from the one resulted from the paper, which may lead to a reasonable increase in the homogeneity measure. ACKNOWLEDGEMENTS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. 7
8 CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Quncai Zou Research Statistician New Jersey Manufacturers Insurance Company 301 Sullivan Way West Trenton, NJ Work Phone: Ext Ryan Diehl, FCAS, MAAA Actuary New Jersey Manufacturers Insurance Company 301 Sullivan Way West Trenton, NJ Work Phone: Ext APPENDIX A MACRO FOR CALCULATING THE DISTANCE BETWEEN THE CENTROIDS OF ANY TWO ZIP CODES %macro distance(long1, lat1, long2, lat2, R); xx = 69.1 * (&lat2 - &lat1); yy = 69.1 * (&long2 - &long1) * cos(&lat1/57.3); distance = sqrt(xx * xx + yy * yy); %mend distance; APPENDIX B MACRO FOR PRODUCING THE CREDIBILITY WEIGHTED PURE PREMIUMS FOR EACH ZIP CODE %macro credibility_weighted_premium(indata, outdata); data &outdata.; set &indata.; %do i = 1 %to &n_zip.; %distance(x, y, &&long&i., &&lat&i., 6371* ); if distance <= 5 then do; sum_loss_5 + &&loss&i.; sum_exp_5 + &&exp&i.; if distance <= 10 then do; sum_loss_10 + &&loss&i.; sum_exp_10 + &&exp&i.; if distance <= 15 then do; sum_loss_15 + &&loss&i.; sum_exp_15 + &&exp&i.; if distance <= 20 then do; sum_loss_20 + &&loss&i.; sum_exp_20 + &&exp&i.; if distance <= 25 then do; sum_loss_25 + &&loss&i.; sum_exp_25 + &&exp&i.; 8
9 % if distance <= 50 then do; sum_loss_50 + &&loss&i.; sum_exp_50 + &&exp&i.; sum_loss_50plus + &&loss&i.; sum_exp_50plus + &&exp&i.; pp5 = sum_loss_5 / sum_exp_5; pp10 = sum_loss_10 / sum_exp_10; pp15 = sum_loss_15 / sum_exp_15; pp20 = sum_loss_20 / sum_exp_20; pp25 = sum_loss_25 / sum_exp_25; pp50 = sum_loss_50 / sum_exp_50; average = sum_loss_50plus / sum_exp_50plus; Z0 = &y.*&w. / (&y.*&w ); Z5 = sum_loss_5 / (sum_loss_ ) - Z0; Z10 = sum_loss_10/(sum_loss_ ) - sum_loss_5 / (sum_loss_ ); Z15 = sum_loss_15 / (sum_loss_ ) - sum_loss_10 / (sum_loss_ ); Z20 = sum_loss_20 / (sum_loss_ ) - sum_loss_15 / (sum_loss_ ); Z25 = sum_loss_25 / (sum_loss_ ) - sum_loss_20 / (sum_loss_ ); Z50 = sum_loss_50 / (sum_loss_ ) - sum_loss_25 / (sum_loss_ ); Z50plus = 1 - Z0 - Z5 - Z10 - Z15 - Z20 - Z25 - Z50; &y_wtd = &y.*z0 + pp5*z5 + pp10*z10 + pp15*z15 + pp20*z20 + pp25*z25 + pp50*z50 + average * Z50plus; DROP Z: pp5 pp10 pp15 pp20 pp25 pp50 distance x y xx yy average sum_: ; %m %credibility_weighted_premium (HO_TERRITORY, HO_TERRITORY_tmp); APPENDIX C MACRO FOR CALCULATING THE WARD DIFFERENCE BETWEEN TWO ZIP CODES %macro ward_diff(n1,n2,n_index1, n_index2); sum_wtd = 0; sum_weight = 0; diff_ward = 0; do i = 1 to &n_index1.; if scan(&n1., i, "_") = "&ind1." then do; sum_wtd = sum_wtd + &resp1. * &weight1.; sum_weight = sum_weight + &weight1.; %do j = 2 %to &N_ZIP.; else if scan(&n1., i, "_") = "&&ind&j." then do; sum_wtd = sum_wtd + &&resp&j. * &&weight&j.; sum_weight = sum_weight + &&weight&j.; % do i = 1 to &n_index2.; if scan(&n2., i, "_") = "&ind1." then do; sum_wtd = sum_wtd + &resp1. * &weight1.; sum_weight = sum_weight + &weight1.; %do j = 2 %to &N_ZIP.; else if scan(&n2., i, "_") = "&&ind&j." then do; sum_wtd = sum_wtd + &&resp&j. * &&weight&j.; sum_weight = sum_weight + &&weight&j.; 9
10 % mean_wtd = sum_wtd / sum_weight; do i = 1 to &n_index1.; if scan(&n1., i, "_") = "&ind1." then diff_ward = diff_ward + (&resp1. - mean_wtd)**2 * &weight1.; %do j = 2 %to &N_ZIP.; else if scan(&n1., i, "_") = "&&ind&j." then diff_ward = diff_ward + (&&resp&j. - mean_wtd)**2 * &&weight&j.; % do i = 1 to &n_index2.; if scan(&n2., i, "_") = "&ind1." then diff_ward = diff_ward + (&resp1. - mean_wtd)**2 * &weight1.; %do j = 2 %to &N_ZIP.; else if scan(&n2., i, "_") = "&&ind&j." then diff_ward = diff_ward + (&&resp&j. - mean_wtd)**2 * &&weight&j.; % %mend ward_diff; APPENDIX D THE MAIN MACRO FOR CREATING CONTIGUOUS GROUPING OF ZIP CODES BASED ON THE CREDIBILITY WEIGHTED PURE PREMIUMS OF EACH ZIP CODE %macro ho_territory(indata, n_terr, limit); %let flag_stop = ; %let temp_data = HO_TERRITORY_1; /*create the macro variables for the macro function %credibility_weighted_premium() */ data _null_; set &indata. end = eof; call symput("code" strip(_n_), code); call symput("long" strip(_n_), x); call symput("lat" strip(_n_), y); call symput("loss" strip(_n_), &w.*&y.); call symput("exp" strip(_n_), &w.); if eof then call symput("n_zip", _N_); /*Call %credibility_weighted_premium() to create the credibility weighted pure premium*/ %credibility_weighted_premium(&indata., HO_TERRITORY_tmp); /*create the macro variables for calculating the ward difference for two neighboring units*/ DATA _NULL_; SET HO_TERRITORY_TMP; CALL SYMPUT ("RESP" STRIP(_n_), &y_wtd.); CALL SYMPUT ("WEIGHT" STRIP(_N_), &w.); CALL SYMPUT ("IND" STRIP(_N_), strip(index)); RUN; /* create pairs of adjacent ZIP codes;*/ data HO_TERRITORY_tmp_1; set HO_TERRITORY_tmp(rename = (index = index1)); array nn(*) n1 - n16; length index2 $ 2000; do i = 1 to dim(nn); if nn(i) ^=. then do; index2 = nn(i); output; keep index1 index2 &y_wtd. &w.; proc sort data = HO_TERRITORY_tmp_1; by index2; 10
11 proc sort data = HO_TERRITORY_tmp; by index; data HO_TERRITORY_paired; merge HO_TERRITORY_tmp_1(rename = (&y_wtd = resp1 &w. = weight1)) HO_TERRITORY_tmp(keep = index &y_wtd. &w. rename = (&y_wtd. = resp2 index = index2 &w. = weight2)); n_index1 = 1; n_index2 = 1; by index2; IF resp2 > 0 and resp1 > 0; drop resp2; /*Calculate the ward difference for each paired zip code*/ data &temp_data.; set HO_TERRITORY_paired end = eof; %ward_diff(index1, index2, n_index1, n_index2); keep index1 index2 n_index1 n_index2 diff_ward weight1 weight2; /*Save the number of territories in a macro varible: tmp_n_terr - will be used in the subsequent %do loop */ proc sql noprint; select (count(distinct(index1))) into: tmp_n_terr from &temp_data.; quit; data sum_squares; %do %while(&tmp_n_terr. > &n_terr and &flag_stop. ne stop); proc sort data = &temp_data. out = tmp_&temp_data. nodup; by diff_ward index1; /*Get the indexes of the two zips with the minimum ward difference*/ data _NULL_; set tmp_&temp_data. end = eof; retain count 0 flag_first 0; length index_new $ 2000; if flag_first = 0 then do; if weight1 + weight2 <= &limit. then do; count + 1; if index1 < index2 then do; index_new = cats(strip(index1), "_", strip(index2)); weight_new = weight1 + weight2; else do; index_new = cats(strip(index2), "_", strip(index1)); weight_new = weight1 + weight2; call symput("index" strip(count), strip(index1)); call symput("index_new", strip(index_new)); call symput("weight_new", weight_new); if count = 2 then flag_first = 1; if eof then do; flag_stop = "stop"; call symput("flag_stop", flag_stop); stop; else do; if count = 0 then do; flag_stop = "stop"; call symput("flag_stop", flag_stop); 11
12 stop; data &temp_data.; set tmp_&temp_data.; if strip(index1) = strip("&index1.") or strip(index1) = strip("&index2.") then do; index1 = strip("&index_new."); n_index1 = count(strip(index1), "_") + 1; weight1 = &weight_new.; %ward_diff(index1, index2, n_index1, n_index2); if strip(index2) = strip("&index1.") or strip(index2) = strip("&index2.") then do; index2 = strip("&index_new."); n_index2 = count(strip(index2), "_") + 1; weight2 = &weight_new.; %ward_diff(index1, index2, n_index1, n_index2); if strip(index1) = strip(index2) then delete; keep index1 index2 n_index1 n_index2 weight1 weight2 diff_ward; proc sql noprint; select (count(distinct(index1))) into: tmp_n_terr from &temp_data.; quit; /*Get the %within variance*/ data temp; set HO_TERRITORY_1; do i = 1 to n_index1; index = scan(strip(index1), i, "_"); output; keep index index1 ; proc sql noprint; create table temp1 as select distinct a.*, b.code, b.&y., b.&w. from temp as a, HO_TERRITORY_tmp as b where a.index = b.index order by index1; quit; data temp1; set temp1; retain territory 0; by index1; if first.index1 then do; territory + 1; proc glm data = temp1; class territory; model &y. = territory; weight &w.; ods output OverallAnova = tmp_sum_squares(keep = Source DF SS); proc sort data = tmp_sum_squares; by descending DF; data tmp_sum_squares; set tmp_sum_squares; retain SS_TOTAL 0; if _n_ = 1 then SS_TOTAL = SS; pct_variation = round(ss / SS_TOTAL * 100, 0.01); 12
13 N_Territory = &tmp_n_terr.; if source = "Error"; data sum_squares; set sum_squares tmp_sum_squares; % %mend ho_territory; REFERENCE 1 Philip J. Jennings, Using Cluster Analysis to Define Geographical Rating Territories, Casualty Actuarial Society, 2008 Discussion Paper Program. 13
Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca
Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?
Constrained Clustering of Territories in the Context of Car Insurance
Constrained Clustering of Territories in the Context of Car Insurance Samuel Perreault Jean-Philippe Le Cavalier Laval University July 2014 Perreault & Le Cavalier (ULaval) Constrained Clustering July
The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon
The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon ABSTRACT Effective business development strategies often begin with market segmentation,
Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical
Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004
Risk pricing for Australian Motor Insurance
Risk pricing for Australian Motor Insurance Dr Richard Brookes November 2012 Contents 1. Background Scope How many models? 2. Approach Data Variable filtering GLM Interactions Credibility overlay 3. Model
Cluster Analysis: Advanced Concepts
Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means
Sales Territory and Target Visualization with SAS. Yu(Daniel) Wang, Experis
Sales Territory and Target Visualization with SAS Yu(Daniel) Wang, Experis ABSTRACT SAS 9.4, OpenStreetMap(OSM) and JAVA APPLET provide tools to generate professional Google like maps. The zip code boundary
Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining /8/ What is Cluster
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
Territorial Analysis for Ratemaking. Philip Begher, Dario Biasini, Filip Branitchev, David Graham, Erik McCracken, Rachel Rogers and Alex Takacs
Territorial Analysis for Ratemaking by Philip Begher, Dario Biasini, Filip Branitchev, David Graham, Erik McCracken, Rachel Rogers and Alex Takacs Department of Statistics and Applied Probability University
Medical Information Management & Mining. You Chen Jan,15, 2013 [email protected]
Medical Information Management & Mining You Chen Jan,15, 2013 [email protected] 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Clustering Algorithms K-means and its variants Hierarchical clustering
exspline That: Explaining Geographic Variation in Insurance Pricing
Paper 8441-2016 exspline That: Explaining Geographic Variation in Insurance Pricing Carol Frigo and Kelsey Osterloo, State Farm Insurance ABSTRACT Generalized linear models (GLMs) are commonly used to
IBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
Innovations and Value Creation in Predictive Modeling. David Cummings Vice President - Research
Innovations and Value Creation in Predictive Modeling David Cummings Vice President - Research ISO Innovative Analytics 1 Innovations and Value Creation in Predictive Modeling A look back at the past decade
A Method for Cleaning Clinical Trial Analysis Data Sets
A Method for Cleaning Clinical Trial Analysis Data Sets Carol R. Vaughn, Bridgewater Crossings, NJ ABSTRACT This paper presents a method for using SAS software to search SAS programs in selected directories
IBM SPSS Direct Marketing 22
IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release
There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:
Statistics: Rosie Cornish. 2007. 3.1 Cluster Analysis 1 Introduction This handout is designed to provide only a brief introduction to cluster analysis and how it is done. Books giving further details are
Applying Customer Attitudinal Segmentation to Improve Marketing Campaigns Wenhong Wang, Deluxe Corporation Mark Antiel, Deluxe Corporation
Applying Customer Attitudinal Segmentation to Improve Marketing Campaigns Wenhong Wang, Deluxe Corporation Mark Antiel, Deluxe Corporation ABSTRACT Customer segmentation is fundamental for successful marketing
Data Visualization Techniques and Practices Introduction to GIS Technology
Data Visualization Techniques and Practices Introduction to GIS Technology Michael Greene Advanced Analytics & Modeling, Deloitte Consulting LLP March 16 th, 2010 Antitrust Notice The Casualty Actuarial
Distances, Clustering, and Classification. Heatmaps
Distances, Clustering, and Classification Heatmaps 1 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be
Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico
Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from
Clustering. Data Mining. Abraham Otero. Data Mining. Agenda
Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in
Effective Use of SQL in SAS Programming
INTRODUCTION Effective Use of SQL in SAS Programming Yi Zhao Merck & Co. Inc., Upper Gwynedd, Pennsylvania Structured Query Language (SQL) is a data manipulation tool of which many SAS programmers are
Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining
Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,
Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004
CLUSTER ANALYSIS FOR SEGMENTATION
CLUSTER ANALYSIS FOR SEGMENTATION Introduction We all understand that consumers are not all alike. This provides a challenge for the development and marketing of profitable products and services. Not every
Overview. NT Event Log. CHAPTER 8 Enhancements for SAS Users under Windows NT
177 CHAPTER 8 Enhancements for SAS Users under Windows NT Overview 177 NT Event Log 177 Sending Messages to the NT Event Log Using a User-Written Function 178 Examples of Using the User-Written Function
Data Mining. Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based
Variable Reduction for Predictive Modeling with Clustering Robert Sanche, and Kevin Lonergan, FCAS
Variable Reduction for Predictive Modeling with Clustering Robert Sanche, and Kevin Lonergan, FCAS Abstract Motivation. Thousands of variables are contained in insurance data warehouses. In addition, external
Calculating Changes and Differences Using PROC SQL With Clinical Data Examples
Calculating Changes and Differences Using PROC SQL With Clinical Data Examples Chang Y. Chung, Princeton University, Princeton, NJ Lei Zhang, Celgene Corporation, Summit, NJ ABSTRACT It is very common
Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems
Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems Ran M. Bittmann School of Business Administration Ph.D. Thesis Submitted to the Senate of Bar-Ilan University Ramat-Gan,
Paper DV-06-2015. KEYWORDS: SAS, R, Statistics, Data visualization, Monte Carlo simulation, Pseudo- random numbers
Paper DV-06-2015 Intuitive Demonstration of Statistics through Data Visualization of Pseudo- Randomly Generated Numbers in R and SAS Jack Sawilowsky, Ph.D., Union Pacific Railroad, Omaha, NE ABSTRACT Statistics
K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1
K-Means Cluster Analsis Chapter 3 PPDM Class Tan,Steinbach, Kumar Introduction to Data Mining 4/18/4 1 What is Cluster Analsis? Finding groups of objects such that the objects in a group will be similar
Portfolio Construction with OPTMODEL
ABSTRACT Paper 3225-2015 Portfolio Construction with OPTMODEL Robert Spatz, University of Chicago; Taras Zlupko, University of Chicago Investment portfolios and investable indexes determine their holdings
Recall this chart that showed how most of our course would be organized:
Chapter 4 One-Way ANOVA Recall this chart that showed how most of our course would be organized: Explanatory Variable(s) Response Variable Methods Categorical Categorical Contingency Tables Categorical
2. Simple Linear Regression
Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according
In comparison, much less modeling has been done in Homeowners
Predictive Modeling for Homeowners David Cummings VP & Chief Actuary ISO Innovative Analytics 1 Opportunities in Predictive Modeling Lessons from Personal Auto Major innovations in historically static
Writing cleaner and more powerful SAS code using macros. Patrick Breheny
Writing cleaner and more powerful SAS code using macros Patrick Breheny Why Use Macros? Macros automatically generate SAS code Macros allow you to make more dynamic, complex, and generalizable SAS programs
STATISTICS AND DATA ANALYSIS IN GEOLOGY, 3rd ed. Clarificationof zonationprocedure described onpp. 238-239
STATISTICS AND DATA ANALYSIS IN GEOLOGY, 3rd ed. by John C. Davis Clarificationof zonationprocedure described onpp. 38-39 Because the notation used in this section (Eqs. 4.8 through 4.84) is inconsistent
Using Macros to Automate SAS Processing Kari Richardson, SAS Institute, Cary, NC Eric Rossland, SAS Institute, Dallas, TX
Paper 126-29 Using Macros to Automate SAS Processing Kari Richardson, SAS Institute, Cary, NC Eric Rossland, SAS Institute, Dallas, TX ABSTRACT This hands-on workshop shows how to use the SAS Macro Facility
How To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
Information Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 10 th, 2013 Wolf-Tilo Balke and Kinda El Maarry Institut für Informationssysteme Technische Universität Braunschweig
Statistical Databases and Registers with some datamining
Unsupervised learning - Statistical Databases and Registers with some datamining a course in Survey Methodology and O cial Statistics Pages in the book: 501-528 Department of Statistics Stockholm University
Model Efficiency Through Data Compression
Model Efficiency Through Data Compression Actuaries Club of the Southwest Houston, TX November 15, 2012 Trevor Howes, FCIA, FSA, MAAA VP & Actuary, GGY AXIS 1 Agenda The need for model efficiency Types
Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA 30602-2501
CLUSTER ANALYSIS Steven M. Ho!and Department of Geology, University of Georgia, Athens, GA 30602-2501 January 2006 Introduction Cluster analysis includes a broad suite of techniques designed to find groups
Hierarchical Cluster Analysis Some Basics and Algorithms
Hierarchical Cluster Analysis Some Basics and Algorithms Nethra Sambamoorthi CRMportals Inc., 11 Bartram Road, Englishtown, NJ 07726 (NOTE: Please use always the latest copy of the document. Click on this
Credibility and Pooling Applications to Group Life and Group Disability Insurance
Credibility and Pooling Applications to Group Life and Group Disability Insurance Presented by Paul L. Correia Consulting Actuary [email protected] (207) 771-1204 May 20, 2014 What I plan to cover
Customer Profiling for Marketing Strategies in a Healthcare Environment MaryAnne DePesquo, Phoenix, Arizona
Paper 1285-2014 Customer Profiling for Marketing Strategies in a Healthcare Environment MaryAnne DePesquo, Phoenix, Arizona ABSTRACT In this new era of healthcare reform, health insurance companies have
A Property & Casualty Insurance Predictive Modeling Process in SAS
Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing
0.1 What is Cluster Analysis?
Cluster Analysis 1 2 0.1 What is Cluster Analysis? Cluster analysis is concerned with forming groups of similar objects based on several measurements of different kinds made on the objects. The key idea
Data Mining Project Report. Document Clustering. Meryem Uzun-Per
Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...
Dealing with continuous variables and geographical information in non life insurance ratemaking. Maxime Clijsters
Dealing with continuous variables and geographical information in non life insurance ratemaking Maxime Clijsters Introduction Policyholder s Vehicle type (4x4 Y/N) Kilowatt of the vehicle Age Age of the
Paper 70-27 An Introduction to SAS PROC SQL Timothy J Harrington, Venturi Partners Consulting, Waukegan, Illinois
Paper 70-27 An Introduction to SAS PROC SQL Timothy J Harrington, Venturi Partners Consulting, Waukegan, Illinois Abstract This paper introduces SAS users with at least a basic understanding of SAS data
Managing Tables in Microsoft SQL Server using SAS
Managing Tables in Microsoft SQL Server using SAS Jason Chen, Kaiser Permanente, San Diego, CA Jon Javines, Kaiser Permanente, San Diego, CA Alan L Schepps, M.S., Kaiser Permanente, San Diego, CA Yuexin
Anti-Trust Notice. Agenda. Three-Level Pricing Architect. Personal Lines Pricing. Commercial Lines Pricing. Conclusions Q&A
Achieving Optimal Insurance Pricing through Class Plan Rating and Underwriting Driven Pricing 2011 CAS Spring Annual Meeting Palm Beach, Florida by Beth Sweeney, FCAS, MAAA American Family Insurance Group
Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank
Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through
A Macro to Create Data Definition Documents
A Macro to Create Data Definition Documents Aileen L. Yam, sanofi-aventis Inc., Bridgewater, NJ ABSTRACT Data Definition documents are one of the requirements for NDA submissions. This paper contains a
Modeling Lifetime Value in the Insurance Industry
Modeling Lifetime Value in the Insurance Industry C. Olivia Parr Rud, Executive Vice President, Data Square, LLC ABSTRACT Acquisition modeling for direct mail insurance has the unique challenge of targeting
Revising the ISO Commercial Auto Classification Plan. Anand Khare, FCAS, MAAA, CPCU
Revising the ISO Commercial Auto Classification Plan Anand Khare, FCAS, MAAA, CPCU 1 THE EXISTING PLAN 2 Context ISO Personal Lines Commercial Lines Property Homeowners, Dwelling Commercial Property, BOP,
Data Analysis Tools. Tools for Summarizing Data
Data Analysis Tools This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool
SAS, Excel, and the Intranet
SAS, Excel, and the Intranet Peter N. Prause, The Hartford, Hartford CT Charles Patridge, The Hartford, Hartford CT Introduction: The Hartford s Corporate Profit Model (CPM) is a SAS based multi-platform
Interfacing SAS Software, Excel, and the Intranet without SAS/Intrnet TM Software or SAS Software for the Personal Computer
Interfacing SAS Software, Excel, and the Intranet without SAS/Intrnet TM Software or SAS Software for the Personal Computer Peter N. Prause, The Hartford, Hartford CT Charles Patridge, The Hartford, Hartford
Comparing Alternate Designs For A Multi-Domain Cluster Sample
Comparing Alternate Designs For A Multi-Domain Cluster Sample Pedro J. Saavedra, Mareena McKinley Wright and Joseph P. Riley Mareena McKinley Wright, ORC Macro, 11785 Beltsville Dr., Calverton, MD 20705
An email macro: Exploring metadata EG and user credentials in Linux to automate email notifications Jason Baucom, Ateb Inc.
SESUG 2012 Paper CT-02 An email macro: Exploring metadata EG and user credentials in Linux to automate email notifications Jason Baucom, Ateb Inc., Raleigh, NC ABSTRACT Enterprise Guide (EG) provides useful
HOMEOWNERS BY-PERIL RATING PLAN
HOMEOWNERS BY-PERIL RATING PLAN Homeowners By-Peril Rating Plan Your Plan to Compete... Homeowners insurers are under intense pressure to rate policies more precisely. With more carriers utilizing refined
Time series clustering and the analysis of film style
Time series clustering and the analysis of film style Nick Redfern Introduction Time series clustering provides a simple solution to the problem of searching a database containing time series data such
Statistical Plans for Property/Casualty Insurers Virginia R. Prevosto, FCAS
Statistical Plans for Property/Casualty Insurers Virginia R. Prevosto, FCAS 201 STATISTICAL PLANS FOR PROPERTY / CASUALTY INSURERS by Virginia R. Prevosto, FCAS, MAAA Abstract This paper reviews the genesis
Foundations & Fundamentals. A PROC SQL Primer. Matt Taylor, Carolina Analytical Consulting, LLC, Charlotte, NC
A PROC SQL Primer Matt Taylor, Carolina Analytical Consulting, LLC, Charlotte, NC ABSTRACT Most SAS programmers utilize the power of the DATA step to manipulate their datasets. However, unless they pull
Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
Using Edit-Distance Functions to Identify Similar E-Mail Addresses Howard Schreier, U.S. Dept. of Commerce, Washington DC
Paper 073-29 Using Edit-Distance Functions to Identify Similar E-Mail Addresses Howard Schreier, U.S. Dept. of Commerce, Washington DC ABSTRACT Version 9 of SAS software has added functions which can efficiently
SYLLABUS OF BASIC EDUCATION Fall 2016 Basic Techniques for Ratemaking and Estimating Claim Liabilities Exam 5
The syllabus for this four-hour exam is defined in the form of learning objectives, knowledge statements, and readings. set forth, usually in broad terms, what the candidate should be able to do in actual
ABSTRACT INTRODUCTION THE MAPPING FILE GENERAL INFORMATION
An Excel Framework to Convert Clinical Data to CDISC SDTM Leveraging SAS Technology Ale Gicqueau, Clinovo, Sunnyvale, CA Marc Desgrousilliers, Clinovo, Sunnyvale, CA ABSTRACT CDISC SDTM data is the standard
CLUSTERING FOR FORENSIC ANALYSIS
IMPACT: International Journal of Research in Engineering & Technology (IMPACT: IJRET) ISSN(E): 2321-8843; ISSN(P): 2347-4599 Vol. 2, Issue 4, Apr 2014, 129-136 Impact Journals CLUSTERING FOR FORENSIC ANALYSIS
Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will
SAS Programming Tips, Tricks, and Techniques
SAS Programming Tips, Tricks, and Techniques A presentation by Kirk Paul Lafler Copyright 2001-2012 by Kirk Paul Lafler, Software Intelligence Corporation All rights reserved. SAS is the registered trademark
Paper RIV15 SAS Macros to Produce Publication-ready Tables from SAS Survey Procedures
Paper RIV15 SAS Macros to Produce Publication-ready Tables from SAS Survey Procedures ABSTRACT Emma L. Frazier, Centers for Disease Control, Atlanta, Georgia Shuyan Zhang, ICF International, Atlanta, Georgia
Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009
Cluster Analysis Alison Merikangas Data Analysis Seminar 18 November 2009 Overview What is cluster analysis? Types of cluster Distance functions Clustering methods Agglomerative K-means Density-based Interpretation
4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"
Data Analysis Plan The appropriate methods of data analysis are determined by your data types and variables of interest, the actual distribution of the variables, and the number of cases. Different analyses
The SPSS TwoStep Cluster Component
White paper technical report The SPSS TwoStep Cluster Component A scalable component enabling more efficient customer segmentation Introduction The SPSS TwoStep Clustering Component is a scalable cluster
Technical Notes for HCAHPS Star Ratings
Overview of HCAHPS Star Ratings Technical Notes for HCAHPS Star Ratings As part of the initiative to add five-star quality ratings to its Compare Web sites, the Centers for Medicare & Medicaid Services
Leveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/4 What is
Northumberland Knowledge
Northumberland Knowledge Know Guide How to Analyse Data - November 2012 - This page has been left blank 2 About this guide The Know Guides are a suite of documents that provide useful information about
Paper D10 2009. Ranking Predictors in Logistic Regression. Doug Thompson, Assurant Health, Milwaukee, WI
Paper D10 2009 Ranking Predictors in Logistic Regression Doug Thompson, Assurant Health, Milwaukee, WI ABSTRACT There is little consensus on how best to rank predictors in logistic regression. This paper
Euclidean Minimum Spanning Trees Based on Well Separated Pair Decompositions Chaojun Li. Advised by: Dave Mount. May 22, 2014
Euclidean Minimum Spanning Trees Based on Well Separated Pair Decompositions Chaojun Li Advised by: Dave Mount May 22, 2014 1 INTRODUCTION In this report we consider the implementation of an efficient
Health Services Research Utilizing Electronic Health Record Data: A Grad Student How-To Paper
Paper 3485-2015 Health Services Research Utilizing Electronic Health Record Data: A Grad Student How-To Paper Ashley W. Collinsworth, ScD, MPH, Baylor Scott & White Health and Tulane University School
Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
An introduction to using Microsoft Excel for quantitative data analysis
Contents An introduction to using Microsoft Excel for quantitative data analysis 1 Introduction... 1 2 Why use Excel?... 2 3 Quantitative data analysis tools in Excel... 3 4 Entering your data... 6 5 Preparing
REx: An Automated System for Extracting Clinical Trial Data from Oracle to SAS
REx: An Automated System for Extracting Clinical Trial Data from Oracle to SAS Edward McCaney, Centocor Inc., Malvern, PA Gail Stoner, Centocor Inc., Malvern, PA Anthony Malinowski, Centocor Inc., Malvern,
