Statistical Analysis of the EPIRARE survey data

Transcription

1 Deliverable D1.4 Statistical Analysis of the EPIRARE survey data Michele Santoro, Michele Lipucci, Fabrizio Bianchi and the EPIRARE Work Package 6 Team 1

2 2

3 CONTENTS Overview of the documents produced by EPIRARE... 4 Disclaimer... 4 I. Part 1: Descriptive, multivariate and exploratory analyses... 5 A. Introduction... 5 B. Methods... 5 C. Results... 6 D. Conclusions E. References II. Part 2: Cluster analysis A. Introduction B. Methods C. Results Target Population Number of diseases Geographical Coverage Data collected Data providers Expected Services by EU Platform 27 D. Conclusions E. References

4 Overview of the documents produced by EPIRARE Disclaimer The contents of this document is in the sole responsibility of the Authors; The Executive Agency for Health and Consumers is not responsible for any use that may be made of the information contained herein. 4

5 I. Part 1: Descriptive, multivariate and exploratory analyses A. Introduction The aim of Work Package 6 "Common data set and disease-specific data collection" is the definition of a dataset structure adoptable by Registries of Rare Diseases, which will allow to collect consistent information compared to the targets defined by the Registries[1,2]. A Registry is made by developing data elements in relation to its aim. The challenge will be to identify a common data elements, defined consistently with the clinical and epidemiological research, that has to be able to standardize the data collection of the rare diseases[3,4]. An analysis was performed to understand the needs and the informative abilities of currently existing Registries, in order to provide a common and shareable informative platform. Therefore data from the Survey, developed for Rare Diseases Registries operating in Europe and other countries, were analyzed. A total of 220 Registries have answered to the questionnaire. Our study activity has been focused only on the questions of interest of the WP6. B. Methods A priori consideration oriented the analysis on some selected questions and on relative answers of the Survey. The question on "Type of data Collected" by the Registry was strictly consistent to the aims of WP6, but it was necessary to evaluate the answers to this question in function of two essential dimensions: purpose and classification of the Registry. The informative heritage collected must be evaluated in respect to the different goals pursued and must be related to the main characteristics of the Registry, as obtained by the questionnaires. Therefore variables analyzed were the following: Aims; Population Target; Number of diseases; Data providers; 5

6 Type of data collected; Disease Coding System; Data sharing. The univariate distribution of response modes, afferent to the questions listed above, was analyzed. Potential associations among variables were initially investigated by bivariate analysis with chi-square test and then studied in deep through multivariate analysis using Logistic Regression models[5,6]. In the end, a factor analysis was performed and was mainly oriented to find out the structure of latent relationships among variables, using the Multiple Correspondence Analysis[7,8,9]. C. Results One of the Survey questions was about the objectives of the Registry with more than 10 response modes allowing an unlimited choice of answers. The "Epidemiological Research" was the main goal declared by the Registries (70.8%), followed by the "Clinical Research" (61.2%) and by the "Natural History of Disease" (60.7%). More than a half of the Registries deals with the "Disease Surveillance" (55.7%), while almost half of the Registries deals with genetic aspects ("Genotype-phenotype correlation" and "Mutation Database"). The "Treatment evaluation" is a target for the 42.9% of Registries, while the "Treatment Monitoring" for the 33.3%; the "Healthcare service planning" for the 33.8%; just 1 Registry out of 5 deals with "Social planning" (19.2%). One of the classifications that characterize a Registry is based on the target population. The Survey showed that 57.1% of responding Registries are Population-based, the 24.0% are Hospital-based, while the 18.9% of them are Case-based. More than the 80% of the Registries surveys a single disease or a group of diseases, while only the 7.3% activated a surveillance on all rare diseases. With regard to data sources, emerges that the "Clinical Units" provide data to the 83.6% of Registries, the "Clinical Genetics Units" to the 43.8%, the "Central Laboratories and services" to the 43.4% and the "Centres of Expertise" to the 30.6%; almost half of the Registries collects data from "Patients and families" and the 21.9% from "Patients' groups"; an interesting aspect is the limited use of routinary informative systems: the 31.1% of the Registries uses data from "Discharge Registers" and only the 12.8% uses "Mortality Registers"; other routinary informative systems are used in a less percentage. The 51.0% of Registries performs an activity of collaboration and of sharing with the Other Registries, the 33.8% with Centre of expertise and the 16.2% with the Biobanks. 6

7 With regard to data collected by the Registries, it is noted that almost all Registries (95.0%) collects information on diagnosis; the 86.6% collects "Clinical data" and the 72.3% collects "Genetic data"; the latter response is accompanied by information on "Family history" (55.0%) and on "Birth and reproductive history" (30.5%). The 61.4% collects information on "Medication, devices and health services", the 46.2% collects the "Socio-demographic information" while only the 32.3% states to register the "Anagraphical data" of patients. The very small percentage of the latter answer suggests a possible misinterpretation of the question and makes it necessary to perform a specific investigation. Less than a half of the Registries adopts the International Classification of Diseases (ICD9, or ICD10, or ICDO) as system of disease coding and the 36.5% doesn t use any code but just reports the disease name; the 13.0% of the Registries adopts the ORPHA code and the same percentage adopts the MIM code; the 25.0% adopts instead an own coding system. Table 1. Distribution of the answers to the following questions: Aim, Population Target, Number of diseases, Data Providers, Data collected, Data sharing and Disease coding system Aim Variable N. % Epidemiological Research ,8 Clinical Research ,2 Natural history of disease ,7 Disease surveillance ,7 Genotype-phenotype correlation ,4 Mutation database 94 42,9 Treatment evaluation (efficacy) 94 42,9 Healthcare service planning 74 33,8 Treatment monitoring (safety) 73 33,3 Social planning 42 19,2 Other 18 8,2 Population Target Population-based ,1 Hospital-based 52 24,0 Case-based 41 18,9 Number of diseases Just one 75 34,3 A group of related RDs ,6 Several RDs (not related) 26 11,9 All rare disease 16 7,3 continua 7

8 continua Data Providers Variable N. % Clinical Units ,6 Patients and families ,4 Clinical genetic Units 96 43,8 Laboratories/central services 95 43,4 Discharge Registers 68 31,1 Centres of Expertise 67 30,6 Patients groups 48 21,9 Mortality Registers 28 12,8 Birth Registers 8 3,7 Disability Registers 7 3,2 Other Registers 15 6,9 Data sharing Other Registers ,0 Biobanks 32 16,2 Centre of expertise 67 33,8 Data Collected Diagnosis ,0 Clinical ,6 Genetic ,3 Medications, devices and health services ,4 Family history ,0 Socio-demographic information ,2 Patient-reported outcomes 78 35,5 Anthropometric information 72 32,7 Anagraphical 71 32,3 Birth and reproductive history 67 30,5 Clinical research participation and bio-specimen donation 67 30,5 Patient's preferences for communication 28 12,7 Disease Coding System ORPHA code 27 13,0 MIM code 27 13,0 ICD O 13 6,3 ICD ,1 ICD ,4 Own code system 52 25,0 Non coding system, just disease name 76 36,5 8

9 Taking in consideration the goals of the WP6 a cross-analysis was performed in order to find out possible dependencies and associations among some of the selected variables. A particularly relevant result of the bivariate analysis emerged about the two main targets declared by the Registries: the "Epidemiological Research" and the "Clinical Research". These objectives are rather exhaustive of the characterization of the mission of the Registry. In fact, only the 6.9% of the Registries does not identify itself in none of the two objectives in question; the 38.8% claims to pursue both goals, while the 54.4% is divided between the two lines of research: the 32.0% deals with "Epidemiological Research" and the 22.4% with "Clinical Research". The two types of research activities point out a diverging trend: the bivariate statistical analysis of the two variables through the chi-square test allows to reject the hypothesis of independence (p=0.003) and highlights an inverse association (coefficient PHI=-0.20). The inverse association emerges also from the "Clinical Research" Odds Ratio compared to the "Epidemiological Research", that is 0.4 (95%CI= ). The analysis pointed out that the Registries tend to follow divergent development strategies according to the two different lines of research; this divergence can characterize a different structure of the Registry in its further purposes and a different informative heritage collected. In relation to the specific aim of WP6, even the possible characterization was investigated in terms of dataset generated from different types of activity. The associations of the answers "Epidemiological Research" and "Clinical Research" were calculated and compared to the other declared objectives. Logistic Regression models were used in order to evaluate such associations using "Epidemiological Research" and "Clinical Research" alternatively as outcome variables. The results are therefore expressed through Odds Ratio (OR) and p-value (Table 2). Table 2. Epidemiological Research and Clinical Research Odds Ratios compared to other objectives Epidemiological Research Clinical Research Aim OR p-value OR p-value Disease surveillance 3,8 0,0004 0,5 0,099 Treatment evaluation (efficacy) 1,3 0,636 1,2 0,685 Mutation database or Genotype-phenotype correlation 0,5 0,039 4,0 <0,0001 Social planning 2,9 0,075 0,6 0,302 Healthcare service planning 0,9 0,843 1,8 0,193 Natural History of disease 1,9 0,093 1,7 0,129 Treatment Monitoring 0,7 0,438 2,5 0,047 The Registries dealing with "Epidemiological Research" show a strong association with the "Disease Surveillance" and with the "Social planning"; they are significantly associated in an inverse way to study aims on genetic aspects, which are here summed up into the joint variable Mutation Database/Genotype- 9

10 phenotype correlation. There is an evident association with "Natural history of disease", and, even if weak, with "Treatment evaluation". There is no association with "Healthcare service planning" and there is a weak inverse association with "Treatment Monitoring". On the contrary, Registries dealing with Clinical Research are inversely associated to the Disease Surveillance and to the Social Planning and therefore they tend not to follow the objective. A positive association, not statistically significant, emerges compared to Healthcare service planning, Natural history of disease and Treatment evaluation. The same model of statistical analysis was implemented compared to "Target Population", "Data providers", Data sharing, "Data Collected" and Disease coding system dimensions. Even compared to the "Target Population" (Table 3), the type of activity shows completely inverse associations. Registries concerning with "Epidemiological Research" have mainly a "Population-based" structure, the "Hospitalbased" are even more frequent than the "Case-based". The Registries dealing with "Clinical Research" are mainly "Case-based, less "Hospital-based" and even less "Population-based". Table 3. Epidemiological Research and Clinical Research Odds Ratios compared to the Target Population Population target Epidemiological Research Clinical Research OR p-value OR p-value Population-based 3,2 0,002 0,5 0,084 Hospital-based 1,6 0,253 0,8 0,584 Case-based 1 1 Compared to the data sources used (Table 4), Registries that deal with "Epidemiological Research" get data from "Clinical Units" and from "Centres of Expertise" more than what Registries dealing with "Clinical Research" do; these latter, instead, mainly draw on informative heritages from patients (patients, families and groups); there is a statistically significant inverse association of the "Epidemiological Research" compared to the "Clinical genetic units" which instead are significantly associated to the "Clinical Research"; all the Registries using "Mortality registers" deal with "Epidemiological Research, while there are no significant associations with the other types of routine informative systems for both the areas of research. 10

11 Table 4. Epidemiological Research and Clinical Research Odds Ratios compared to the Data Providers Data provider Epidemiological Research Clinical Research OR p-value OR p-value Clinical Units 2,3 0,060 1,0 0,936 Clinical genetic Units 0,3 0,001 2,1 0,041 Patients and families 1,3 0,478 2,1 0,020 Patients' groups 0,7 0,430 1,7 0,238 Laboratories/central services 1,5 0,290 0,7 0,309 Centres of expertise 3,3 0,003 1,3 0,495 Discharge Registers 1,1 0,774 1,2 0,616 Mortality Registers * 0,6 0,255 Birth, Disability and other Registers 0,8 0,795 0,7 0,519 * All Registries using Mortality Registries deal with Epidemiological Research. The activity of sharing of Other Registers and Centre of expertise is more intense with Registries that deal with Epidemiological Research, while the Biobanks tend to a greater sharing with Registries dealing with Clinical Research (Table 5). Tabella 5. Epidemiological Research and Clinical Research Odds Ratios compared to the Data sharing Epidemiological Research Clinical Research Data sharing OR p-value OR p-value Other Registers 1,5 0,185 1,3 0,351 Biobanks 1,1 0,803 1,7 0,248 Centres of expertise 1,9 0,101 1,3 0,473 With regard to the typology of the data collected (Table 6), the analysis allowed to point out considerably different informative characteristics compared to the research typology pursued by the Registry. The mode of response Diagnosis was erased from the analysis since diagnosis date are collected by almost all the Registries making therefore this variable not explicative. The variable "Epidemiological Research" highlights a positive and statistically significant association with "Socio-demographic information" and it is also positively associated with "Anagraphical data" and with "Clinical data", instead it is strongly and inversely associated with "Genetic data". On the contrary the variable "Clinical Research" is significantly associated with "Genetic data", as well as with "Medications, devices and health services" and "Clinical Research participation and bio-specimen donation"; it is also associated, but not significantly, with "Clinical data" and it is statistically significant the inverse association with "Socio-demographic information." 11

12 Table 6. Epidemiological Research and Clinical Research Odds Ratios compared to the Data collected Epidemiological Clinical Research Research Data OR p-value OR p-value Anagraphical 1,8 0,130 1,2 0,618 Socio-demographic information 4,2 <0,0001 0,5 0,050 Genetic 0,2 0,003 4,4 0,001 Clinical 2,8 0,085 1,9 0,234 Medications, devices and health services 0,9 0,792 2,6 0,015 Patient-reported outcomes 0,9 0,861 1,3 0,524 Family history 1,3 0,459 1,3 0,468 Anthropometric information 1,6 0,236 0,6 0,264 Birth and reproductive history 1,0 0,931 0,7 0,459 Clinical research participation and bio-specimen donation 0,6 0,257 3,0 0,013 In the end, the Registries that deal with Epidemiological Research tend to use the ICD coding system and not the MIM code, while for Registries that deal with the Clinical Research is all inverted. Furthermore, the Epidemiological Research is positively but not significantly associated with the use of the ORPHA code and with the use of an own coding system (Table 7). Table 7. Epidemiological Research and Clinical Research Odds Ratios compared to the Disease Coding System Epidemiological Clinical Research Research Coding system OR p value OR p value ORPHA code 1,7 0,413 0,6 0,295 MIM code 0,2 0,003 5,3 0,003 ICD 4,2 0,048 0,3 0,054 Own code system 3,4 0,060 0,8 0,720 Non coding system 2,2 0,292 0,9 0,787 The factor analysis substantially confirmed the cognitive framework emerged from the results produced by the model of logistic regression. A Multiple Correspondence Analysis was performed in order to build a factorial plan able to highlight latent structures of relationship between the data; the following variables were selected as active variables to define factorial axes: - Aims; - Population target; - Number of diseases. 12

13 On the once built factorial plan, were projected the other variables considered in the statistical model as supplementary ones: - Data providers; - Data sharing; - Data collected; - Disease coding system. Figure 1 reports the spatial plan defined by the first two factorial axes. The inertia explained by the first axis, according to the correction of Benzecrì[10], is equal to 56.44%, while the second axis explains an inertia of 41.79%, determining therefore, a total variability explained by the plan of 98.23%. The variables related to the objectives of the Registries which provide the greatest contribution to the definition of the first factorial axis are: "Treatment evaluation", "Treatment Monitoring", Social planning", "Healthcare service planning", "Disease surveillance" and "Natural History of Disease". Therefore, the factorial axis could be interpreted as a measure of the monitoring and evaluation activity. In the upper part of the second axis Is reported the contribution of the "Epidemiological Research", while in the lower part, the contribution of the "Clinical Research", "Genetic Research" and "Natural History of Disease. The axis orientation could be interpreted as follow: downward the research on disease and upward the research on population. In fact, the "Population-based" mode is located at the top along the second axis, but it is not associated with the dynamic of monitoring explained by the first axis; the "Case based" mode is located in the diametrically opposite part; the "Hospital-based" mode is located downward but is also associated with the first factorial axis. The mode "All/Several diseases" is located upward along the second axis, while One disease" and "A group of diseases are located along the second axis with a greater contribution of the latter mode. In the upper part of the plan, Epidemiological research, "Disease surveillance", "Healthcare service planning" and Social planning", being placed on the same direction, are correlated. In the lower part of the plan a correlation between "Clinical Research" and "Genetic Research" is shown, as well as between "Treatment evaluation" and "Treatment Monitoring". 13

14 Figure 1. Factorial Plan determined by the active variables all/several RDs Health service planning Social planning Population-based Epidemiological research Disease suveillance Case-based one RD a group of RDs Hospital-based Clinical research Genopheno/Mutation History of disease Treatment evaluation Treatment monitoring Figure 2 shows the spatial plan where the collected information related to the Data providers and to the Data sharing are projected. It must be noted that data from routine informative system (mortality, discharge and other registries) are located in the first quadrant of the factorial plan, the one oriented to the monitoring in the public health field. The variable "Laboratories/central services" is also located in the first quadrant, but it s moved towards the origin of the axes which represents the center gravity. "Clinical Units" and, more clearly "Centres of Expertise", tend to be strongly associated with the first factor but they do not discriminate with respect to the second; the variables "Patients family, Patients' groups" and "Genetic Units" lie in the direction of clinical and genetic research. The sharing with the Other Registers is located in the upper part of the plan while the one with the Biobanks is located in the lower part. Figure 3 shows the Data collected and Disease coding system projections on the factorial plan. In the first quadrant there are "Anagraphical data" mode and "Socio-demographic information" mode, while in the lower quadrant are located "Genetic data", "Family history", "Clinical research participation and biospecimen donation", "Anthropometric information", "Medications, devices and health services" and "Clinical data"; "Diagnosis" is confirmed to be a not discriminant variable. "Birth and reproductive history", "Anthropometric information", "Clinical research participation and biospecimen donation" are data that tend to be mostly collected by the Registries that deal with monitoring and evaluation in a clinical field. 14

15 The use of the ICD code, located in the upper quadrant, highlights an association with the Epidemiological Research and with the Population-based ; in the same quadrant there is even the ORPHA code while in the opposite quadrant, associated with the clinical and genetic research, is represented the MIM code use and the variable No coding system is used. Figure 2. Projection of the Data providers and of the Data sharing on factorial plan mortality register all/several RDs other register Health service planning Social planning Case-based one RD genetic units Population-based discharge register share other register laboratories Epidemiological research clinic units centre expert patient family share biobank a group of RDs Hospital-based patient group History of disease Clinical research Genopheno/Mutation Disease suveillance Treatment evaluation Treatment monitoring 15

16 Figure 3. Projection of the Data collected and of the Disease coding system on factorial plan all/several RDs Health service planning Social planning no code MIM code Case-based one RD Population-based ORPHA code Epidemiological research anagraphic socio demo genetic ICD code Disease suveillance diagnosis own code birth reprod clinical medic, health serv a group of RDs anthropometric Hospital-based History of disease family history biospecimen Clinical research Genopheno/Mutation Treatment evaluation Treatment monitoring D. Conclusions The Analysis of Survey Data focused on the specific objectives of WP6, allowed to obtain useful and important informations on the features of the existing Registries that could assist the analysis process for the definition of the common dataset. Particularly, it was found that the Registries show a tendency to separation of the research lines and this divergence consequently influences all the informative heritage. The divergence emerged between Registries pursuing an epidemiological research and those which pursue a clinical research, clearly shows two well different types of Registries. On the one hand, there are Registries, potentially population-based, whose main target is the "Epidemiological Research"; these Registries deal with the surveillance of diseases and the social relapse, interface themselves with other information systems, use the ICD system, collect personal data. On the other hand, there are Registries, potentially case-based and/or hospital-based, whose main goal is the Clinical Research ; these Registries are pursuing goals on genetic aspects, on assessment and on monitoring of treatments, for which the bulk of information is strictly based on genetic and clinical data, and, obviously, on data concerning the diagnosis. At a glance, there are Registries that we could define as population-oriented, that potentially pursue public health objectives, and Registries disease-oriented with clinical-genetic research objectives. 16

17 It s worth of note that these registries showed also several common traits, which should be deepen in the perspective of identifying a common data set. Such differences may be indicative of possible limitations in orienting the global research activities that should characterize a Registry of Rare Diseases that willing to be a useful tool for public health. This kind of research must necessarily use both epidemiological and clinical data, and at the same time should represent the basement for the development of specilyzed registries. In this perspective, the identification of a common dataset assumes a strategic relevance for collecting consistent data able to develop a solid and flexible platform for research and public health activities. E. References 1. Nadkarni PM, Brandt CA (2006) The common data elements for cancer research: remarks on functions and structure. Methods Inf. Med. 45(6): Carter J, Evans J., Tuttle M., Weida T, White T, Harvell J. Shipley S (2006). Making the minimum data set compliant with health information technology standards. Excecutive summary. U.S. Department of Health and human Services, Accessed: 2nd September Richesson RL, Krischer JP (2007) Data standard in clinical research: gaps, overlaps, challenges and future directions. J. Am. Med. Inform. Assoc. 14(6): AHRQ (2010) Registries for evaluating patient outcomes: a user s guide. In: Glicklich RE, Dreyer N (eds) Agency for Healthcare Research and Quality, Rockville, MD. 5. McCullagh, P, Nelder, JA (1989), Generalized Linear Models, Second Edition, London: Chapman and Hall. 6. Woodward M (2005) Epidemiology: Study Design and Data Analysis, Second Edition, New York: Chapman & Hall/CRC 7. Benzécri, JP (1973), L Analyse des Données: T. 2, l Analyse des Correspondances, Paris: Dunod. 8. Greenacre, MJ (1984), Theory and Applications of Correspondence Analysis, London: Academic Press. 9. Greenacre, MJ (1994), Multiple and Joint Correspondence Analysis, in: MJ Greenacre and J Blasius, (eds), Correspondence Analysis in the Social Sciences, London: Academic Press. 10. Benzécri, JP (1979), Sur le Calcul des taux d inertie dans l analyse d un questionaire, Addendum et erratum á [BIN.MULT.], Cahiers de l Analyse des Données 4,

18 II. Part 2: Cluster analysis A. Introduction The objective of WP6 concerns the development of a proposal of a common dataset applied to the Registries on Rare Diseases exploring a bottom-up approach. The data collected by a questionnaire sent to 220 Registries operating in different European countries, represent a huge amount of information, crucial to understand the characteristics and weaknesses that distinguish the active Registries in the field of rare diseases. The analysis of such data is used to generate relevant information to support the definition of a common dataset. An initial analysis of the Survey, focused to the specific objective of WP6, has already provided important information on the actual differentiation of the Registries on Rare Diseases, which reflect the different objectives. The previous analysis (see WP6 Interim Technical Report) has allowed to identify different patterns. In particular, a multivariate analysis was conducted with the aim to identify relations between variables that define the objectives and other characteristic elements of the Registries. Multivariate analysis, carried out through the use of the technique of Multiple Correspondences, allowed the identification of relations among groups of variables that showed two macro-groups of Registries ( there are Registries that we could define as population-oriented, that potentially pursue public health objectives, and Registries disease-oriented with clinical-genetic research objectives ). We performed further analysis on the database provided by the Survey with the aim of obtaining useful information to understand the profile of the Registries and to define more accurately the common information needs (common dataset). A cluster analysis was done in order to identify groups of Registries with common traits and characteristics. Underlining that in this second analytical approach has been moved the point of observation: from variables (Correspondences Analysis) to units of observation, the Registries (Cluster Analysis). B. Methods Cluster Analysis is a set of statistical techniques that allows, by iterative processes, to identify groups of similar observations with respect to specific characteristics. We performed a Cluster Analysis using a hierarchical model, which is a computational process that integrates, by progressive steps, observations more "close" and similar to each other, starting from all of observations until you get to a single group. Priority was given to the Aims declared by the Registries, as the objectives of a registry represent the main 18

19 feature associated with the information need. We carried out additional analysis that take in consideration as explanatory variables, in addition to the Aims, also other variables collected by the questionnaire: Number of diseases, Population target, Geographical Coverage. The results of these additional analysis did not provide significative patterns of grouping, or they provided a such large number of clusters which does not allow a clear interpretation. This result is of great importance and it means that the result obtained with a variable or several variables is quite similar for the identification of a small number of clusters which represents an indirect measure of solid consistency of the result achieved. The Cluster Analysis performed in our study, was carried out on the following Aims declared by the Registries: Epidemiological Research Clinical Research Natural history of disease Disease surveillance Genotype-phenotype correlation + Mutation database Healthcare service planning Social planning Treatment evaluation (efficacy) Treatment monitoring (safety) For the interpretation of Clusters identified by statistical analysis, we analyzed the distribution of Aims in each Cluster and evaluated the deviation of the frequency respect to an expected value. The expected value for each Aim was estimated applying to each cluster the overall percentage calculated on all Registries ( i.e.: the percentage of the Aim Clinical Research in all Registries is 61.2.; this value was used to calculate the Expected value of Clinical Research for each Cluster). The deviation of the value observed respect to the expected value is expressed as percentage Deviation (%Deviation=100*(Number Observed- Number Expected)/Number Expected). To validate the consistency of the interpretation of clusters and produce additional information on the characterization of the different types of Registries, we estimated the distribution and the percentage deviation also on the following questions: Target Population, Number of diseases, Geographical Coverage, Collected Data, Data Providers and Services expected by EU platform. 19

20 C. Results The iterative process of calculation concerning the clustering can be viewed by the Dendrogram (Figure 1), which puts in evidence three major clusters. Also the statistical tests (cubic clustering criteria and Pseudo F test), aimed at a proper definition of the number of clusters, confirm the identification of 3 clusters (Figure 2). The number of Registries present in each cluster (Table 1) is quite balanced: the first group, named Cluster 1 includes 52 Registries (23.7%), 86 Registries are in the Cluster 2 (39.3%) and 81 in the Cluster 3 (37.0%). Table 1. Composition of the Clusters Cluster Number % Cluster ,7 Cluster ,3 Cluster ,0 Totale 219* *1 Registry not analysed: the Aims reported as missing Figure 1. Dendrogram of Cluster Analysis 20

21 Figure 2. Criteria for the definition of number of Clusters: Cubic Clustering e Pseudo F We analyzed the distribution of the Aims in each group to facilitate the interpretation of the identified Clusters. In Table 2 are shown, for each of the three clusters, the distribution of Aim and the percentage deviation (see Methods) compared to the expected value. In the Cluster 1, 90.4% of Registries perform Epidemiological Research, 75.0% pursue the objectives of Disease Surveillance and 63.5% Healthcare Service Planning. In the Cluster 1 the expected number of Registries with the Aim of Epidemiological Research is 37, whereas the observed number is 47, corresponding to +28%. The Cluster 1 shows a positive deviation also for: Disease Surveillance (+35%), Healthcare Service planning (+88%) and Social Planning (+70%); instead it shows negative deviations for: Clinical research (-65%), Natural History of disease (-81%) and Mutation database or Genotype-phenotype correlation (-93%). For the same Aims, the Cluster 2 shows a contrary trend with a positive deviation for: Clinical research (+14%), Natural History of disease (+9%) and Mutation or Genotype-phenotype correlation database (+38%); negative deviations are highlighted instead for: Epidemiological research (-21%), Disease Surveillance (-54%), Healthcare Service planning (-79%), Social Planning (-88%). Both Cluster exhibit a lower percentage for Treatment evaluation (Cluster 1: -60%; Cluster 2: -86%) and for Treatment Monitoring (Cluster 1: -83%, Cluster 2: -86%). The Cluster 3 showed an higher percentage than expected for all the Aims, but especially for Aims relating to the Treatment: 98.8% (+130% compared to the expected value) of the Registries enclosed in the Cluster 3 declares to make Treatment evaluation and 81. 5% (+144%) Treatment Monitoring. 21

22 Table 2. Number and percentage of Registries observed, Number of Registries Expected, Percentage Deviation from Expected, by Cluster and Aim Aim Cluster 1 (n=52) Cluster 2 (n=86) Cluster 3 (n=81) N (%) Exp Dev N (%) Exp Dev N (%) Exp Dev Clinical research 11 (21.2) 32-65% 60 (69.8) % 63 (77.8) % Disease surveillance 39 (75.0) % 22 (25.6) 48-54% 61 (75.3) % Epidemiological research 47 (90.4) % 48 (55.8) 61-21% 60 (74.1) 57 +5% Genotype-phenotype /mutation database 2 (3.8) 30-93% 69 (80.2) % 56 (69.1) % Healthcare services planning 33 (63.5) % 6 (7.0) 29-79% 35 (43.2) % Natural history of disease 6 (11.5) 32-81% 57 (66.3) 52 +9% 70 (86.4) % Social planning 17 (32.7) % 2 (2.3) 16-88% 23 (28.4) % Treatment evaluation 9 (17.3) 22-60% 5 (5.8) 37-86% 80 (98.8) % Treatment monitoring 3 (5.8) 17-83% 4 (4.7) 29-86% 66 (81.5) % N= Number of Registries observed in the Cluster Exp = Number of Registries expected in the Cluster Dev = Percentage Deviation of Observed value from Expected value (see Methods) The interpretation of these results seem to be quite clear for the first two Clusters: the Cluster 1 is characterized by a type of Registry which pursues Aims relating to the activities of Public Health; the Cluster 2 identifies a type of Registry more oriented in Clinical and Genetic Research. The interpretation of Cluster 3 is more complex: it seems to include Registries based on research for the assessment and the monitoring of the Treatment. Regarding the Cluster 3 we observed a higher percentage also of all the other Aims which could be explained as a bias due by a declaration of multiple objectives of the Registries. Basically, the results of the Cluster Analysis are in agreement with the findings obtained by the previous Multiple Correspondences analysis, which aimed to search the correlations among the variables. The factorial plan determined by statistical method of Multiple Correspondences (Figure 3) identified clearly the association among the 3 groups of Aims, as also confirmed by the Cluster Analysis. 22

23 Figure 3. Factorial Plan by Analysis of Multiple Correspondence The Cluster analysis indicated the presence of three macro-types of Registries with Aims which show a tendency to differentiation mainly for the first 2 types. The joint interpretation of the results obtained by Cluster Analysis and Multiple Correspondences Analysis suggest to name the three cluster of Registries as: Cluster 1: Public Health Registries Cluster 2: Clinical and Genetic Research Registries Cluster 3: Treatment Registries. On the basis of this interpretation we analyzed the distribution of other variables collected by the Survey within the three types of Registries identified. The results are expressed in terms of the percentage distribution within the group, and as a percentage deviation from the expected value calculated in the same way of the question Aims. The results are reported for the questions: Target Population, Number of diseases, Geographical Coverage, Collected Data, Data Providers and Services expected by EU platform. 1. Target Population The 78.8% of the Registries belonging to the group "Public health" is Population-based, whereas only a small number is Case-based (5.8%) and Hospital-based (15.4%). Even the Registries belonging to the other two types claim to be prevalent Population-based, but with values lower than expected. Registries "Clinical- 23

24 Genetic Research" showed a greater tendency to be case-based (+30% compared to the expected), whereas the "Treatment" to be Hospital-based (+26%). This result is consistent with the interpretation given to the Cluster. Table 3. Number and percentage of Registries observed, Number of Registries Expected, Percentage Deviation from Expected, by Cluster and Population target Population target Public Health Clinical-Genetic Research Treatment N (%) Exp Dev N (%) Exp Dev N (%) Exp Dev Case based 3 (5.8) 10-70% 21 (24.7) % 17 (21.5) % Hospital based 8 (15.4) 13-36% 20 (23.5) 20-2% 24 (30.4) % Population based 41 (78.8) % 44 (51.8) 48-9% 38 (48.1) 45-16% 2. Number of diseases The Registries "Public Health", in contrast to the other two types, show a greater tendency to cover all diseases (+73% compared to the expected value). The 47.7% of Registries "Clinical-Genetic Research" collect data on a group of diseases, the 34.9% collect data on a single disease and the 17.4% cover all diseases. Such distribution reflects the expected distribution. The Registries "Treatment" deal mainly with a group of rare diseases (49.4%), or one rare disease (38.3%, with a value of 13% above the expected). Table 4. Number and percentage of Registries observed, Number of Registries Expected, Percentage Deviation from Expected, by Cluster and Number of diseases Number of diseases Public Health Clinical-Genetic Research Treatment N (%) Exp Dev N (%) Exp Dev N (%) Exp Dev A group / several 21 (41.2) 24-12% 41 (47.7) 40 +2% 40 (49.4) 38 +6% All 17 (33.3) % 15 (17.4) 17-9% 10 (12.3) 16-36% Just one 13 (25.5) 17-25% 30 (34.9) 29 +3% 31 (38.3) % 3. Geographical Coverage The majority of the Registries has a national coverage in all the 3 typologies with a different distribution: although 50% of the Registries "Public Health" have national coverage, the value is 19% less than the expected; the Registries "Clinical-Genetic Research" show a frequency of national coverage equal to the expected; the Registries "Treatment" exhibit a higher value than expected (+12%). The 22 Registries "Public Health", compared to 9 Registries expected (+148%), have a regional coverage, and this deviation is 24

25 reversed for the other two types ("Clinical-Genetic Research" -58%, "Treatment" -35%). The international coverage is provided by a small number of Registries "Public Health" (-79% compared to expected), whereas the "Clinical-Genetic Research" show a positive deviation (+52%). Table 5. Number and percentage of Registries observed, Number of Registries Expected, Percentage Deviation from Expected, by Cluster and Geographical coverage Geographical Coverage Public Health Clinical-Genetic Research Treatment N (%) Exp Dev N (%) Exp Dev N (%) Exp Dev International 2 (3.8) 9-79% 23 (27.4) % 14 (17.3) 15-4% Local 2 (3.8) 2 +19% 3 (3.6) 3 +11% 2 (2.5) 3-23% National 26 (50.0) 32-19% 52 (61.9) 52 0% 56 (69.1) % Regional 22 (42.3) % 6 (7.1) 14-58% 9 (11.1) 14-35% 4. Data collected Diagnosis is an information collected by almost all of the Registries (95%), so there are no substantial differences among the three types of Registries. "Public Health" Registries tend to collect more anagraphic data (+26% compared to the expected) and socio-demographic data (+28%), whereas other information are collected with values below the expected. The 88.4% of Registries "Clinical-Genetic Research" collect clinical data and 8.0% genetic data; these Registries show also a higher frequency than that expected for the collection of data on Family history (+17%) and Patient's preferences for communication (+27%). The 97.5% of Registries "Treatment" collect Clinical data and show a positive percentage deviation for all types of information, except for the anagraphic data; in particular the highest values are highlighted for: anthropometric info (+68%), Clinic research participation and biospecimen donation (+39%), Birth and reproductive history (+49%), Family history (+22%), Medications devices and health services (+37%), Patient-reported outcomes (+65% ). 25

26 Table 6. Number and percentage of Registries observed, Number of Registries Expected, Percentage Deviation from Expected, by Cluster and Data collected Data collected Public Health Clinical-Genetic Research Treatment N (%) Exp Dev N (%) Exp Dev N (%) Exp Dev Anagraphic 21 (40.4) % 26 (30.2) 27-5% 23 (28.4) 26-11% Anthropometric info 8 (15.4) 17-53% 19 (22.1) 28-32% 44 (54.3) % Clinic research participation and biospecimen donation 6 (11.5) 16-62% 26 (30.2) 26 0% 34 (42.0) % Birth and reproductive history 9 (17.3) 16-43% 21 (24.4) 26-20% 37 (45.7) % Clinical data 35 (67.3) 45-22% 76 (88.4) 75 +2% 79 (97.5) % Diagnosis 52 (100) 49 +5% 77 (89.5) 82-6% 79 (97.5) 77 +3% Family history 11 (21.2) 28-61% 55 (64.0) % 54 (66.7) % Genetic data 21 (40.4) 38-44% 74 (86.0) % 63 (77.8) 58 +8% Medications, devices and health services Patient's preferences for communication 22 (42.3) 32-31% 44 (51.2) 53-16% 68 (84.0) % 3 (5.8) 7-55% 14 (16.3) % 11 (13.6) 10 +6% Patient-reported outcomes 9 (17.3) 18-51% 21 (24.4) 30-31% 47 (58.0) % Socio demographic info 32 (61.5) % 26 (30.2) 41-37% 47 (58.0) % 5. Data providers The Clinical Units are the most providers for the three types of Registries. "Public health" Registries show a tendency to the use of Health Information Systems (Hospital databases, Mortality and other Registries), whereas it is limited the use of information from the Clinical Genetic Units (-34%), from the patients and their families (-29%) and patients organisations (-30%). In contrast, Registries "Clinical-Genetic Research" tend to use as data sources the context of the patients (Patients and families +15%, Patients' groups +16%), and the Clinical Genetic Units (+14%), while the Health Information Systems are not used. Most data providers for the Registries "Treatment" are: the Centers of Expertise (+18%), Clinical units (+11%), Clinical genetic units (+8%) and Hospital databases (+12%). 26

27 Table 7. Number and percentage of Registries observed, Number of Registries Expected, Percentage Deviation from Expected, by Cluster and Data provider Data provider Public Health Clinical-Genetic Research Treatment N (%) Exp Dev N (%) Exp Dev N (%) Exp Dev Centres of expertise 13 (25.0) 16-19% 25 (29.1) 26-5% 29 (36.3) % Clinical genetic units 15 (28.8) 23-34% 43 (50.0) % 38 (47.5) 35 +8% Clinical units 43 (82.7) 43-1% 65 (75.6) 72-9% 74 (92.5) % Hospital databases 23 (44.2) % 17 (19.8) 27-37% 28 (35.0) % Laboratories/central services 26 (50.0) % 32 (37.2) 37-15% 37 (46.3) 35 +6% Mortality registers 17 (32.7) % 1 (1.2) 11-91% 10 (12.5) 10-3% Other registers 14 (26.9) % 3 (3.5) 10-70% 8 (10.0) 9-13% Patients and families 18 (34.6) 25-29% 48 (55.8) % 40 (50.0) 39 +3% Patients' groups 8 (15.4) 11-30% 22 (25.6) % 18 (22.5) 18 +2% 6. Expected Services by EU Platform The main expected services by the Platform for Registries "Public health" is the Quality control system (66.0%), whereas for the Registries "Clinical-Genetic Research" and "Treatment" are the IT tools. In relation to their epidemiological function, "Public health" Registries have a higher expectation on services for the Facilitated Access to data sources (+11%). Table 8. Number and percentage of Registries observed, Number of Registries Expected, Percentage Deviation from Expected, by Cluster and Expected service by EU Platform Expected service by EU Platform Public Health Clinical-Genetic Research Treatment N (%) Exp Dev N (%) Exp Dev N (%) Exp Dev Expert technical advice 15 (31.9) 18-18% 28 (35.4) 31-9% 33 (47.8) % Facilitated access to data sources 23 (48.9) % 22 (27.8) 35-37% 41 (59.4) % IT tools 27 (57.4) 32-16% 56 (70.9) 54 +3% 51 (73.9) 47 +8% Legal advice 21 (44.7) 23-7% 37 (46.8) 38-3% 36 (52.2) 33 +8% Model documents 21 (44.7) 22-5% 47 (59.5) % 24 (34.8) 33-26% Quality control systems. 31 (66.0) % 37 (46.8) 45-17% 42 (60.9) 39 +8% Tools for networkig among partners and Registries 24 (51.1) 27-10% 45 (57.0) 45 0% 42 (60.9) 39 +7% 27

28 Registries "Clinical-Genetic Research" declare an higher expectation on services related to Model documents (+26%). Registries "Treatment" express an higher expectation on services for the Facilitated access to data sources (+35%) and Expert technical advice (+23%). D. Conclusions Cluster analysis identified three main typologies of Registries with Aims which showed a clear pattern of differentiation in particular for Cluster 1 and 2. By the analysis of the distribution of the Aim and the percentage deviation from the expected values, it was possible to define the three types of Registries named as: Registries for "Public health", Registries "Clinical-Genetic Research" and Registries "Treatment". Their distribution, related to the number of Registries in the Survey, exhibited a lower presence of Registries "Public health" (23.7%) compared to Registries "Clinical-Genetic Research" (39.3%) and "Treatment" (37.0%). The analysis of the questions in Survey have allowed to verify a framework rather consistent with the definition of the Clusters and they provided a huge amount of useful information for the characterization of the different typologies of Registries. Registries "Public health" pursue aims of Epidemiological Research, Social Planning, Healthcare Services Planning and Disease Surveillance; they show a greater tendency than other types of Registries to be population-based, to collect information on all diseases and to have a regional coverage closer to the target of health policy; consistently with their epidemiological function, they tend to collect information like as anagraphical and socio-demographic data, using more Health Information Systems as a data source; the expectations of these types of Registries on the services of the Platform were addressed to Quality control system and Facilitated access to data sources. The Registries "Clinical-Genetic Research" follow more aims regarding clinical and genetic research; they are mostly population-based, but show a tendency to be structured as case-based; they usually collect data on one or groups of diseases, they have a national geographical coverage with a tendency to have also an international coverage; they collect mainly information on clinical and genetic data, family history and patient's preferences for communication; the most data providers are clinical and genetic units, patients and their families and patients organizations; the main expectations from the services of the Platform are focused on IT tools and Model documents. The Registries "Treatment" pursue mainly aims relating to the Treatment Evaluation and Treatment Monitoring; they have a higher tendency to be hospital-based and focused on one disease; the geographical coverage is usually national; they collect different types of data mainly concerning: Clinical data, Clinic research participation and biospecimen donation, anthropometric info, Birth and reproductive 28

29 history, Family history, Medications devices and health services and Patient-reported outcomes; the most important data providers are the Clinical Units and the Centres of Expertise and the major expectations from the services of the Platform are addressed on IT Tools with a relevant interest for Facilitated access to data sources and expert technical advice. The statistical analysis allowed to explore the complex and fragmented framework on Registries of rare diseases highlighting structures with differentiated and interrelated profiles. These results represent an useful source of information to develop an oriented planning which can facilitate the interoperability and interconnection of Registries in accordance with the different profiles identified. Such a trait appears fundamental in the process for the rare registries platform construction. E. References 1. Anderberg M.R. (1973), Cluster Analysis for applications, New York: Academic press, Inc. 2. EPIRARE Work Package 6 (2012), Statistical Analysis of the EPIRARE survey database, 3. Interim Technical Report Appendix1 4. EPIRARE Work Package 8 (2012), Developing a European Platform for Rare Disease Registries, Draft 08/11/ EPIRARE (2012), EPIRARE Survey, 6. Everitt B.S. (1980), Cluster Analysis, 2nd Edition, London: Heineman Educational Books Ltd. 7. Fabbris L. (1983), Analisi esplorativa di dati multidimensionali, Cleup Editore 8. Fabbris L. (1997), Statistica multivariata, Milano: McGraw-Hill Libri Italia. 9. Milligan, G.W. and Cooper, M.C. (1985), An Examination of Procedures for Determining the Number of Clusters in a Data Set, Psychometrika, 50, SAS Institute Inc (2009), SASOnlineDoc9.2., Cary, NC: SAS Institute INC 11. SAS Institute Inc (1999), SAS/STAT User s Guide, Version 8, Cary, NC: SAS Institute INC 12. Sarle, W.S. (1983), Cubic Clustering Criterion, SAS Technical Report A-108, Cary, NC: SAS Institute Inc. 29