in measuring severity of depression in a UK sample of primary care patients with a diagnosis of depression

Similar documents
Technical Information

Managing depression after stroke. Presented by Maree Hackett

The dual diagnosis capability of residential addiction treatment centres: Priorities and confidence to improve capability following a review process

National Disability Authority Resource Allocation Feasibility Study Final Report January 2013

Assessment, Case Conceptualization, Diagnosis, and Treatment Planning Overview

Learning Disabilities

Improving the Measurement Properties of Depressive Symptom Instruments for Use after SCI

Mental Health Smartphone Application A New Initiative for Mental Health Care Providers

AUTISM SPECTRUM RATING SCALES (ASRS )

General Symptom Measures

Stigmatisation of people with mental illness

Early Childhood Measurement and Evaluation Tool Review

Progress Report Phase I Study of North Carolina Evidence-based Transition to Practice Initiative Project Foundation for Nursing Excellence

CARE MANAGEMENT FOR LATE LIFE DEPRESSION IN URBAN CHINESE PRIMARY CARE CLINICS

WHAT IS A JOURNAL CLUB?

Executive Summary. 1. What is the temporal relationship between problem gambling and other co-occurring disorders?

Chapter 4 Brief self-rated screening for depression on the internet

Running head: ASPERGER S AND SCHIZOID 1. A New Measure to Differentiate the Autism Spectrum from Schizoid Personality Disorder

COMMONLY ASKED QUESTIONS ABOUT THE ASI

Technical Report. Overview. Revisions in this Edition. Four-Level Assessment Process

CLINICAL EXCELLENCE AWARDS. Academy of Medical Royal Colleges submission to the Review Body on Doctors and Dentists Remuneration

Partial Estimates of Reliability: Parallel Form Reliability in the Key Stage 2 Science Tests

Stocktake of access to general practice in England

Assessment of children s educational achievements in early childhood education

2. Incidence, prevalence and duration of breastfeeding

Assessment Policy. 1 Introduction. 2 Background

CALCULATIONS & STATISTICS

A STUDY OF APPLICABILITY OF HAMILTON DEPRESSION RATING SCALE IN A TERTIARY PSYCHIATRY CLINIC OF KOLKATA

Ministry of Social Development: Changes to the case management of sickness and invalids beneficiaries

research/scientific includes the following: statistical hypotheses: you have a null and alternative you accept one and reject the other

TAXREP 01/16 (ICAEW REP 02/16)

Cancer Support Tools in Practice

Health services management education in South Australia

Feifei Ye, PhD Assistant Professor School of Education University of Pittsburgh

A break-even analysis of delivering a memory clinic by videoconferencing

Measurement: Reliability and Validity Measures. Jonathan Weiner, DrPH Johns Hopkins University

Is the degree of cognitive impairment in patients with Alzheimer s disease related to their capacity to appoint an enduring power of attorney?

NHSScotland Staff Survey National Report

Fixing Mental Health Care in America

Comorbidity of mental disorders and physical conditions 2007

A National Statistics Publication for Scotland

Michael E Dewey 1 and Martin J Prince 1. Lund, September Retirement and depression. Michael E Dewey. Outline. Introduction.

MRC Autism Research Forum Interventions in Autism

Predictors of recovery and legal representation in a compensation setting 12 months post injury: The Whiplash Outcome Study [WOS]

Mental Health Needs Assessment Personality Disorder Prevalence and models of care

Body Mass Index as a measure of obesity

Overview. Benefits and Features

Costing statement: Depression: the treatment and management of depression in adults. (update) and

NHS Diabetes Prevention Programme (NHS DPP) Non-diabetic hyperglycaemia. Produced by: National Cardiovascular Intelligence Network (NCVIN)

CHAPTER 2: CLASSIFICATION AND ASSESSMENT IN CLINICAL PSYCHOLOGY KEY TERMS

Family Focused Therapy for Bipolar Disorder (Clinical Case Series) Participant Information Sheet

TECHNICAL/CLINICAL TOOLS BEST PRACTICE 7: Depression Screening and Management

SCOTTISH FINGERPRINT SERVICE REPORT TO ESTABLISH AN APPROPRIATE AUTHORISED BUREAU ESTABLISHMENT OF FINGERPRINT EXPERTS

Understanding Clinical Trials

The relationship between mental wellbeing and financial management among older people

BriefingPaper. The access/relationship trade off: how important is continuity of primary care to patients and their carers?

Mild depression: Few, if any, symptoms in excess of the 5 required to make the diagnosis, and symptoms result in only minor functional impairment.

Impact of Event Scale

Report on the Scaling of the 2014 NSW Higher School Certificate. NSW Vice-Chancellors Committee Technical Committee on Scaling

Family APGAR. Smilkstein, G. 1978

Depression, anxiety and long term conditions. Linda Gask Professor of Primary Care Psychiatry University of Manchester

Mental health and social wellbeing of gay men, lesbians and bisexuals in England and Wales A summary of findings

Pragmatic Evidence Based Review Substance Abuse in moderate to severe TBI

DSM-5: A Comprehensive Overview

Second English National Memory Clinics Audit Report

Guidelines for Documentation of Attention Deficit/Hyperactivity Disorder In Adolescents and Adults

College Readiness LINKING STUDY

Validation of the Treatment Related Impact Measure for Diabetes Treatment and Device: TRIM-Diabetes and TRIM-Device

HMRC Tax Credits Error and Fraud Additional Capacity Trial. Customer Experience Survey Report on Findings. HM Revenue and Customs Research Report 306

What happens to depressed adolescents? A beyondblue funded 3 9 year follow up study

ADOPTION RESEARCH INITIATIVE BRIEFING ENHANCING ADOPTIVE PARENTING: A RANDOMISED CONTROLLED TRIAL OF ADOPTION SUPPORT

Behavior Rating Inventory of Executive Function - Adult Version BRIEF-A. Interpretive Report. Developed by

NHSScotland Staff Survey National Report

PSYCHOLOGY 320L Problem Set #3: One-Way ANOVA and Analytical Comparisons

THE OREGON STATE REHABILITATION COUNCIL AND OFFICE OF VOCATIONAL REHABILITATION SERVICES 2008 CLIENT CONSUMER SATISFACTION SURVEY

Time to Act Urgent Care and A&E: the patient perspective

Main Section. Overall Aim & Objectives

Parkinson s Disease: Factsheet

Tara Stevermuer (MAppStat), Centre for Health Service Development, University of Wollongong.

MODULE 1.3 WHAT IS MENTAL HEALTH?

SCAN Program (Supporting Children with Additional Needs)

Critical Appraisal of Article on Therapy

Trauma Center Assessment Package

Dual Diagnosis Capability

East Midlands Cancer Clinical Network Improving Lung Cancer Outcomes. Dr Paul Beckett Royal Derby Hospital

Running Head: INTERNET USE IN A COLLEGE SAMPLE. TITLE: Internet Use and Associated Risks in a College Sample

COI Research Management Summary on behalf of the Department of Health

MOD Core Civilian. Contents Page. A1 Average annual basic salary for all permanent employees by gender and grade 3

Guidelines on best practice in recruitment and selection

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Treating Depression to Remission in the Primary Care Setting. James M. Slayton, M.D., M.B.A. Medical Director United Behavioral Health

Survey to Doctors in England End of Life Care Report prepared for The National Audit Office

Screening Adults for Asperger Syndrome Using the AQ: A Preliminary Study of its Diagnostic Validity in Clinical Practice

UNDERSTANDING THE TWO-WAY ANOVA

The Michael Palin Centre for Stammering Pine Street, London EC1R OJG

Caregiving Impact on Depressive Symptoms for Family Caregivers of Terminally Ill Cancer Patients in Taiwan

Corporate Medical Policy

Sample Size and Power in Clinical Trials

Statistical tests for SPSS

Multivariate Analysis of Variance. The general purpose of multivariate analysis of variance (MANOVA) is to determine

Transcription:

Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS-SR 16 in measuring severity of depression in a UK sample of primary care patients with a diagnosis of depression Author: Applied Health Sciences (Mental Health), University of Aberdeen

NHS Quality Improvement Scotland 2011 First published February 2011 This report was published by the sponsor, NHS Quality Improvement Scotland (NHS QIS), on behalf of the author, The University of Aberdeen. You can copy or reproduce the information in this document for use within NHSScotland and for educational purposes. You must not make a profit using information in this document. Commercial organisations must get our written permission before reproducing this document. www.nhshealthquality.org 2

Assessing the validity of the PHQ-9, HADS, BDI-II and QIDS-SR 16 in measuring severity of depression in a UK sample of primary care patients with a diagnosis of depression Grant holders Professor Ian C Reid * Dr Isobel M Cameron Professor John R Crawford *** Dr Kenneth Lawton * ** * Applied Health Sciences (Mental Health), University of Aberdeen, Royal Cornhill Hospital, Aberdeen, AB25 2ZH. ** School of Psychology, University of Aberdeen, Aberdeen, AB24 2UB *** Centre of Academic Primary Care, University of Aberdeen, Foresterhill Health Centre, Westburn Road, Foresterhill, Aberdeen AB25 2AY Research Team Isobel M Cameron, Research Fellow Amanda Cardy, Scottish Primary Care Research Network North Coordinator John R Crawford, Professor of Psychology Schalk Du Toit, Specialist Registrar in Psychiatry Kenneth Lawton, Clinical Senior Lecturer Steven Hay, Staff Grade Psychiatrist Kenneth Mitchell, Consultant Psychiatrist Ian C Reid, Professor of Mental Health Sumit Sharma, Specialist Registrar in Psychiatry Shilpa Shivaprasad, Specialist Registrar in Psychiatry Sally Winning, Staff Grade Psychiatrist Ethical Approval This research was conducted with the approval of the North of Scotland Research Ethics Committee (reference number: 07/S0802/40). Additionally it met the approval of the Research and Development Department of NHS Grampian and of the Scottish Primary Care Research Network. Acknowledgements We would like to thank the patients and staff of the nine practices in Grampian who kindly participated in this study and Kirsty Sykes for preparation of research materials. 3

Contents 1 Summary... 6 Background... 6 Why we did it..6 Aims... 6 Methods... 6 Results... 6 Conclusion...7 What it means for the service... 7 2 Background... 8 3 Aim... 10 4 Methods/Design... 11 Study Design... 11 Setting... 11 Patients... 11 Depression Severity Measures... 11 Demographic data... 13 Procedure... 13 Statistical analyses... 15 Sample Calculation... 15 5 Results... 16 Participating practices... 16 Patient participation... 16 Reliability Analysis... 19 Factor analysis... 21 Convergent and discriminant validity... 23 Convergence of severity banding... 24 Empirically derived cut offs... 25 Responsiveness to change... 28 6 Discussion... 29 Summary of main findings... 29 Strengths and Weaknesses... 29 Comparisons with literature... 30 7 Conclusion... 32 8 Recommendations for NHS QIS... 33 Appendix I Assessing the suitability of the QIDS-SR 16 as a proxy for the HRSD-17... 34 Background... 34 Methods/Design... 34 Discussion... 38 Conclusion... 38 Appendix II Assessing the inter-rater reliability of the clinician administered HRSD-17 (GRID version)... 39 Background... 39 Methods/Design... 39 4

Procedure... 40 Statistical analyses... 40 Results... 40 Discussion... 42 Conclusion... 42 9 References... 43 5

1 Summary Background The Integrated Care Pathway (ICP) for Depression and the Quality and Outcomes Framework (QOF) emphasise the importance of measuring severity of depression in primary care to target the condition with an appropriate intervention; however there is an absence of psychometric comparison of endorsed measures (Patient Health Questionnaire (PHQ-9), Hospital Anxiety and Depression Scale (HADS) and Beck Depression Inventory (BDI-II)). Why We Did It For people presenting with depression in primary care, it is considered important that the severity of their depressive symptoms be measured in order to facilitate the offer of appropriate, evidence based interventions. This is reflected in the ICP for depression and in the QOF of the ngms contract. However, there is an absence of objective psychometric comparison between endorsed measures to enable GPs to choose the optimal assessment tool for severity measurement. Aims To assess psychometric properties of PHQ-9, HADS, BDI-II and Quick Inventory of Depressive Symptomatology (QIDS-SR 16 ) relative to the clinician administered Hamilton Rating Scale for Depression (HRSD) in a sample of primary care patients with a depression diagnosis. Methods Patients ( 16 years) were recruited from nine general practices across Grampian (selected to yield participants with a mix of socio-economic and urban/rural status). Consenting participants completed the four severity measures, demographic questions and were assessed by a psychiatrist with the 17-item HRSD (GRID- HAMD). Psychiatrists (n=6) were blind to the questionnaire responses. Order of administration of interview and booklet was randomly assigned, stratified by practice. Inter-rater reliability was assessed. A concurrent psychometric analysis was made of the four scales to assess: internal consistency, homogeneity, convergent and discriminant validity and responsiveness to change over time. Established severity cut-off scores for each scale were assessed for convergence with HRSD-17 cut offs. Receiver Operating Characteristic (ROC) curves were plotted to assess optimal severity cut off points. Analyses were conducted using SPSS (Version 17) and Clinimetrics Toolkit. Ethical approval was granted by the North of Scotland Research Ethics Committee. Results 286 (25%) of 1134 invited patients participated: mean age=49.8 (s.d.=13.8), 69% female, mean HRSD-17=12.9 (s.d.=7.6). The HAD-D, PHQ-9, BDI-II and QIDS-SR 16 all exhibited acceptable internal consistency (Cronbach s alpha 0.86-0.92) and robust factor structures (variance measured by 1 st factor 46.3%-60.7%) indicating 6

one underlying construct in each measure. Additionally, all the measure correlated highly with the HRSD-17 (r=0.68-0.78). The HAD-D, PHQ-9 and QIDS-SR 16 exhibited discriminant validity but the BDI-II did not. All of the scales differed significantly in how they categorised the severity of depression relative to the HRSD- 17 (Wilcoxon Signed Rank Test p<0.05). The HAD-D tended to categorise participants in milder categories than the HRSD-17 whereas the PHQ-9, BDI-II and QIDS-SR 16 tended to categorise participants in a more severe category than the HRSD-17. Best sensitivity and specificity was reached for detecting moderate severity of depressive symptoms where HAD-D 9, PHQ-9 12, BDI-II 23 and QIDS- SR 16 13. Based on diagnostic odds ratios, QIDS-SR 16 exhibited greatest discriminatory performance (diagnostic odds ratio=21.42 (95% CI 11.03, 41.60). All four questionnaires exhibited a similar measurement of magnitude of change over time. Conclusion Commonly used scales in UK primary care do not align adequately with the HRSD- 17. Current severity cut-offs for such scales are therefore invalid for the selection of treatments determined by depression severity, given that HSRD assessments generally inform the evidence base. The present study provides empirically derived cut off points which can be used with these measures to facilitate the assessment of severity of depressive symptoms. However, health care professionals should not rely on this interpretation of a score alone when assessing an individual who may have depression but should also consider other factors including degree of impairment, length of episode, history of depression, family history, other co-morbid disorders and specific circumstances pertaining to individuals. What it means for the service Our findings show that commonly used scales in UK primary care for measuring the severity of depressive symptoms are invalid for selecting treatments which have been determined by depression severity. This study provides alternative cut offs that have been empirically derived from a comparison with a Hamilton Depression Rating Scale interview administered by a psychiatrist. These new cut offs improve the validity of the measures however, these scales do not assess the severity of depressive symptoms with sufficient accuracy to be used with confidence in clinical practice. For this reason, the service should emphasise that health care professions should not rely on the scores of such scales alone when assessing an individual who may have depression but should consider other factors including the degree of impairment, length of episode, history of depression, family history, other comorbid disorders and specific circumstances pertaining to individuals. 7

2 Background The Scottish Integrated Care Pathway for Depression Toolkit Standard 33 (1) advocates the use of an objective measure for people with depression, stating that A validated measure of depression is used at initial assessment and repeated at regular intervals to monitor progress and outcome. Similarly, the Quality and Outcomes Framework (QOF) of the new General Medical Services contract (ngms) (2) provides incentives to general practices for making an assessment of severity at the outset of a new diagnosis of depression with a further assessment occurring within five to 12 weeks. In the context of the 2009 Updated National Institute for Health and Clinical Excellence (NICE) Guidelines (3) on the management of depression in primary and secondary care it would appear sensible to make such an assessment as different treatment options are advocated according to severity. Additionally, applying an objective measure of severity, rather than relying on GPs perceptions, is thought to be more reliable (4). Using such measures facilitates the monitoring of symptoms over time allowing patients and clinicians to assess the impact of various interventions. At the outset of a new episode of depression, general practitioners (GPs) are required to use one of the following QOF endorsed tools: the Patient Health Questionnaire (PHQ-9) (5); the Hospital Anxiety and Depression Scale (HADS) (6) Depression subscale (HAD-D); and the Beck Depression Inventory, Second Edition (BDI-II) (7). Practices are advised to choose one of these three measures. These measures are endorsed on the grounds of their validity. As their function in the context of the QOF is to measure severity, particular attention should be paid to how their severity cut-off bands were derived. The established severity cut-off scores for the PHQ-9 have been based on assessment of US based primary care patients (5). Whilst questions arise as to how well findings from a US sample can be generalised to the UK setting, the severity cut-offs were based on a substantial sample (n=580) recruited across several primary care sites. In the case of BDI-II, the severity cut-off scores are based on a US sample of 127 patients, all from a University of Pennsylvania site (8). The original banding of non-case, doubtful case and case of the HADS were based on a sample of 100. The originators offered no empirical evidence to support endorsing a cut off of 15 and over as indicating a severe case (9). Although the HADS has been extensively researched, its validity has tended to be assessed in terms of its case finding properties, rather than on how it differentiates between severities of depression (10-12). Presently, there is an absence of objective psychometric comparison between the endorsed measures that would enable GPs to choose the optimal assessment tool for severity measurement. Furthermore, the severity-related interventions advocated in the NICE Guidelines tend to be based on trials where severity has been assessed using the Hamilton Rating Scale for Depression (HRSD) (13). There have been two studies in Germany which have compared validity which has included the QOF depression severity scales (14,15). One compared HADS, PHQ-9 and the World Health Organisation Well Being Index (WHO-5) against the DSM-III-R overview for the Structured Clinical Interview (SCID) (16), the other compared the PHQ-9, General Health Questionnaire (GHQ-12) and WHO-5 against the SCID. However these comparisons were of their validity as case finding measures. These studies did not examine the validity of the severity banding. A recent German study compared 8

the PHQ-9 with the HRSD-17 in terms of severity measurement (17). The PHQ-9 accounted for just 27% of the variance in HRSD-17 scores. However, this analysis was conducted on a small sample (n=47) and included HRSD-17 and PHQ-9 data that were administered up to two weeks apart (mean = 6.7 days); this is a wide time interval from which to assess concurrence of instruments. A study which described the first UK validation of the PHQ-9 (18) assessed the PHQ-9 and the Clinical Outcomes in Routine Evaluation Outcome Measure (CORE-OM) as diagnostic tools. Here the focus was on case finding properties in a relatively small sample (n=97). Additionally, in recognising this was a small study, the authors concluded that replication of their findings would be helpful. In the absence of UK validation of the severity banding of the QOF depression measures, GPs may be left to make their choice of measure based on convenience factors, such as licence-cost considerations or ease of scoring. These considerations are important but should always be preceded by considerations of psychometric robustness. Psychometric investigations of the PHQ-9 and HAD-D have found that the scales differ significantly in how they categorise the severity of depression in UK (19), Swedish (20) and Australian (21) samples. These finding suggests at least one of these measures is categorising the severity of depression erroneously. Similarly, severity of depressive symptoms varied in accordance with choice of screening tool in a study of the QOF measures in an English primary care setting where practices that used PHQ-9 recorded greater prevalence of moderate and severe depressive symptoms compared with practices that used HAD-D (22). In considering the ICP standard 33 and the QOF depression severity measures, the Quick Inventory of Depressive Symptomatology (Self Report) (QIDS-SR 16 ) (23) is also a scale worthy of investigation. The QIDS-SR 16 is a measure of depression severity, devised in the US which has been assessed and developed to act as a proxy measure for the HRSD. This link with HRSD makes its potential for use in a UK primary care context particularly of interest given the use of HRSD in studies assessing treatment effectiveness (3). 9

3 Aim The principal aims of this study were as follows: 1. To assess the psychometric properties of the PHQ-9, HAD-D, and BDI-II in terms of their use as measures of severity of depression in a sample of primary care patients with a diagnosis of depression. 2. To establish empirically derived severity cut-off scores for these measures as they relate to the HRSD-17. In meeting the above aims the following objectives were required to be met: a) To assess the suitability of the QIDS-SR 16 as a proxy for the HRSD-17 following the conversion tables of Rush et al., (www.ids-qids.org). b) To assess the inter-rater reliability of the clinician administered HRSD-17 (GRID version). The investigations relating to objective a) are reported in Appendix I and those relating to objective b) are reported in Appendix II. These secondary objectives had implications for the main study in that their results lead us to conclude that the QIDS- SR 16 (using the conversion tables of Rush et al., 2003) was not suitable to act as a proxy for the HRSD-17 and that the HRSD-17 (GRID version) demonstrated adequate inter-rater reliability. Although the QIDS-SR 16 would not be used as a proxy for HRSD-17, it remained in the study and we assessed its psychometric properties along with the other self-complete measures. 10

4 Methods/Design Study Design This study is a psychometric assessment of four depression severity rating scales: PHQ-9, HAD-D, BDI-II and QIDS-SR 16. The aim was to assess the relative reliability, validity and responsiveness to change of these measures and to empirically derive severity bands which can be applied to a UK primary care population. Setting Patients were recruited from nine general practices across Grampian, Scotland. Practices were selected with the intention of yielding participants with a mix of socioeconomic and urban/rural status. Socio-economic status was based on the deprivation status of practices as derived from Scottish Index of Multiple Deprivation (SIMD) scores. The urban/rural status of practice populations was assessed by the Scottish Executive 2007/8 urban/rural classification whereby practices were coded according to their modal urban/rural category based on the number of patients in each category per practice. Practice populations by deprivation status and by urban/rural status were accessed at: www.isdscotland.org/isd/3793.html. The HRSD-17 was administered to patient participants within their general practice or within a local community based hospital. The questionnaire booklet, containing the four self-complete measures and demographic questions, was completed by patient participants within their own space, ie this could be done at home or within another location of the participant s choice. Patients Inclusion criteria Patients were eligible for inclusion if they were aged 16 years and over and their GP had diagnosed them as depressed. This reflects current QOF arrangements where GPs use their clinical judgement to identify depression. The current study included existing depressed patients in order to assess the scales across the spectrum of stages presenting in primary care. Both the Depression ICP toolkit Standard 33 and the QOF Depression 3 indicator advocates repetition of measurement therefore the inclusion of existing depressed patients was relevant as it allowed for the scales to be tested within greater presentation diversity. Exclusion criteria The vast majority of depressed patients were eligible to consider participating; however patients were excluded if they did not have the necessary spoken or written language skills. Depression Severity Measures The depression severity measures to be assessed comprised the HADS, PHQ-9, BDI-II and the QIDS-SR 16 against the HRSD (17 item). 11

Hospital Anxiety and Depression Scale (HADS) The HADS scale consists of 14 items each rated from 0 to 3 according to severity of difficulty experienced. Eight items require reversed scoring, after which a depression (HAD-D) and an anxiety (HAD-A) subscale total can be summed. Each subscale score can range from 0 to 21. The scores can then be interpreted as indicating either: mild, moderate or severe difficulty. Licence agreement was attained through NFER Nelson. Patient Health Questionnaire 9 (PHQ-9) The PHQ-9 consists of 9 questions designed to correspond to the nine diagnostic criteria for major depressive disorder covered in the Diagnostic and Statistical Manual for Mental Disorders (DSM-IV) (24). Items are rated from 0 to 3 according to increased frequency of experiencing difficulties in each area covered. Scores are summed and can range from 0 to 27. The score can then be interpreted as indicating either: no depression, minimal, mild, moderate, moderately severe or severe depression. The scale is freely available from http://www.depressionprimarycare.org/clinicians/toolkits/materials/forms/phq9/. Beck Depression Inventory-II The BDI-II consists of 21 items each rated from 0 to 3 according to severity of difficulty experienced. Scores are summed and can range from 0 to 63. The score can then be interpreted as being in the minimal, mild, moderate or severe range. Licence agreement was attained through Harcourt Assessment, Inc. Quick Inventory of Depressive Symptomatology (Self-report) QIDS-SR16 QIDS-SR16 is a brief self-complete measure designed to measure the severity of depression. It assesses the nine diagnostic criteria for major depressive disorders covered in DSM-IV and consists of 16 items each ranging from 0 to 3. It is scored by summing the highest response in each of a set of questions relating to sleep, weight and psychomotor symptoms and then adding the remaining items. Scores can range from 0 to 27. Conversion tables have been generated which allows HRSD scores to be predicted from this measure however this has not been previously assessed in a UK primary care context. QIDS-SR 16 is freely available from http://www.ids-qids.org/. Hamilton Rating Scale for Depression (17 item) HRSD-17 The 17-item HRSD structured interview is intended for use as the gold standard of depression severity measurement. The HRSD-17 has been chosen for this purpose as it has been widely used in intervention studies which have taken into account depression severity (3). Overwhelmingly, what is known about treatment efficacy has been based on trials which have assessed depression severity using this method. Additionally, the appropriateness of using the HRSD-17 in a primary care context has long been established (25). Initially the standard HRSD-17 schedule was used however following discussions at a UK primary care psychiatry meeting; the GRID- HAMD was brought to our attention and was used thereafter. It is freely available from http://www.iscdd.org/. Both administration and scoring are standardised in this method which helps maximise inter-rater reliability without altering the original intent of the measure. However, to ensure inter-rater reliability between the psychiatrist interviewers, an additional analysis was conducted and is reported in Appendix II. 12

Demographic data Within the questionnaire booklet, demographic questions were included to allow an assessment of the sample s representativeness in terms of age, sex, ethnicity and deprivation status based on Scottish Index of Multiple Deprivation (SIMD) as derived from participants home postal codes. Procedure All recruits were invited to complete the four questionnaires on two occasions, three months apart. Additionally, they were invited to participate in a clinical interview conducted by a consultant psychiatrist, a specialist registrar or a staff grade psychiatrist. Prospective participants were given the alternative of taking part in a telephone interview if attending would be difficult. In such cases the GRID-HAMD was still used however as two of the items required visual observations (regarding retardation and agitation), raters were given instructions for the telephone version of HRSD (26) in order to facilitate the assessment of these components. The interviewers were blind to the questionnaire responses. Once recruitment had commenced (n=18) (following feedback from a conference presentation) randomisation of order of administration of interview and booklet, stratified by practice, was introduced. This was done to reduce any confounding of order of completion. For those randomised to receive the questionnaire first, participants were encouraged to complete the booklet on the same day (before the interview), or the day before the interview. For those randomised to receive the questionnaire following the interview, participants were encouraged to complete the booklet on the same day (after the interview), or the following day. The process is outlined in Figure 1. 13

Figure 1: Recruitment and participation Invitation to participate (n=1134) GP gives/sends, to patients with a diagnosis of depression, information sheet, reply slip and freepost envelope. Patient considers information in his/her own time, phoning research team to discuss any queries as necessary (n=306). Patient declines (n=20) No further action Receives questionnaire before (n randomised=137; n not randomised=8) Three days prior to appointment researcher posts to patient: appointment confirmation letter, questionnaire, consent form and reply paid envelope. Appointment for HRSD-17 (n=141) Study doctor confirms patient has completed consent form and questionnaire and conducts HRSD-17 assessment. Patient accepts (286) Patient contacts University research team (by reply slip or telephone) indicating interest and is randomised to receive questionnaire either before or after the interview. They are allocated an appointment with a study doctor to take place at practice, community hospital or by telephone). Receives questionnaire after (n randomised =131; n not randomised=10) Researchers post to patient: appointment confirmation letter. Appointment for HRSD-17 (n=140) Study doctor completes consent form with participant and conducts HRSD-17 assessment. The participant is given the questionnaire and reply paid envelope to take away and complete. Follow up (282 sent out) Patients sent three month follow up questionnaire. Up to two reminder letters were sent to non-responders. 14

Statistical analyses A concurrent analysis was made of the self-complete scales. Internal consistency was assessed using Cronbach s alpha and item-total correlations. The factor structure was examined to assess the homogeneity of each scale by observing the amount of variance extracted by the first factor, using principal components factor analysis. The robustness of the emerging factors in each scale was analysed by running coefficients of congruence (27) between the samples from first to second time point. Convergent validity was examined by computing correlations of each of the self complete scales with the HRSD-17. Discriminant validity was investigated by intercorrelating the scales with the HAD Anxiety subscale (HAD-A) to assess whether they exhibited greater convergence with the HRSD-17 than with the HAD-A. Both convergence and discrimination was assessed using William s Test (28). Convergence of the scales severity bandings was also investigated. The established severity cut-off scores for each scale were assessed for convergence using Wilcoxon Signed Rank Test for related samples. Severity cut off bands for moderate depressive symptoms were assessed relative to HRSD-17 14 using Receiver Operating Characteristic (ROC) curves (29). Sensitivity and specificity of the scales at detecting symptoms of moderate severity were calculated with accompanying confidence intervals (30). Positive and negative predictive values (PPVs and NPVs) were also calculated as were Diagnostic Odds Ratios (DORs) with accompanying confidence intervals (Glas et al., 2003). DORs allow assessment of each scale s discriminatory performance using a single indicator relative to HRSD-17 defined moderate severity. Responsiveness to clinical change, across the two time points was measured by running paired t-tests on the scales; the effect size of each scale was then calculated (31). Analyses were conducted using SPSS (Version 17) and Clinimetrics Toolkit. Sample Calculation The original intention was to recruit a sample of 500. However, this was based on the assumption that the QIDS-SR 16 would be used as a proxy for the HRSD-17 once its fitness for purpose was established. As this was not established, it was not considered feasible to recruit 500 participants to have the clinical interview. In light of the continued need for clinical interviews, the sample size was revised to be n> 250. Larger samples are required to detect small differences between groups. Where a sample is too small a risk emerges of a type-ii error occurring; where one fails to find a significant difference between samples when a difference genuinely exists. In the current study, preliminary analysis revealed marked differences in depression severity categorisation therefore even with a sample size of half the original intent, the power remained adequate. 15

5 Results Participating practices Practice characteristics are presented in Table 1. The nine practices represent a range of urban and rural settings. In terms of predominant deprivation status of their catchments there was representation of four of the quintiles from Scotland. No practices in Grampian fall into the fifth quintile for Scotland, representing greatest deprivation. Table 1 Characteristics of participating practices Practice A B C D E F G H I SIMD quintile 1 2007 1 st 4 th 1 st 4 th 4 th 3 rd 2 nd 3 rd 4 th Modal urban/rural category 2 Other urban area settlements of 10 000 125 000 people Large urban area settlements of over 125,000 people Accessible small town settlements of 3 000 10 000 people and within 30 minutes drive of settlements of 10 000 or more Large urban area settlements of over 125,000 people Large urban area settlements of over 125,000 people Other urban area settlements of 10 000 125 000 people Remote small town settlements of 3 000 10 000 people within 30-60 minutes drive to settlements of 10 000 or more Remote small town settlements of 3 000 10 000 people within 30-60 minutes drive to settlements of 10 000 or more Large urban area settlements of over 125,000 people 1 Scottish Index of Multiple Deprivation quintile for Scotland where 1st represents least deprived and 5th most deprived (mapped to 2006 reference files); 2 Urban/rural practice status according to Scottish Government 2007/2008 urban rural classification. Patient participation Two hundred and eighty-six (25%) of 1134 invited patients participated. Table 2 presents demographic characteristics of participants. 16

Table 2 Demographic characteristics of patient participants Characteristic Summary statistic N Age (standard deviation) Female (%) Ethnicity (%) Asian Black Other White First language English n (%) Main activity (%) Employed/self employed Housework Seeking work Retired Student Other Educated beyond minimum school age (%) Educated to degree level (%) 49.8 years (13.8) 197 (69) 1 1 1 267 (99) 261 (98) 136 (51) 45 (17) 10 (4) 48 (18) 4 (2) 23 (8) 138 (52) 95 (36) 270 286 270 266 266 266 266 Of the 286 participants, 269 (94%) completed both the questionnaire booklet and participated in the interview (Table 3). This occurred as some interview participants did not subsequently complete and return the questionnaire. Some participants indicated they did not have time to take part in an interview but were willing to complete the questionnaire. Such participation was considered of value to the study as this data could be included in the analyses which did not require the HRSD-17. For example, analyses assessing internal consistency and factor structure of the questionnaires. The first 19 (7%) participants were assessed with the standard version of the HRSD- 17. Following this, 263 (93%) participants were assessed with the GRID version. Two hundred and forty (85%) interviews were conducted face-to-face and 42 (15%) by telephone. 17

Table 3 Participation in interview and questionnaire Returned questionnaire (%) Did not return questionnaire (%) Participated in HRSD-17 (%) 269 (94) 13 (4) Did not participate in HRSD-17 (%) 4 (1) - Three month follow up questionnaires were completed by 214 (76%) participants. The distribution of the HRSD-17 total scores did not deviate from Normal distribution as confirmed by the one-sample Kolmogorov-Smirnov test of goodness of fit: D=0.064, p=0.199. The mean score = 12.9, s.d. = 7.57. This corresponds with the outer extreme of the mild category as defined by the Handbook of Psychiatric Measures (32). Table 4 presents the HRSD-17 scores according to the established categories. Table 4 HRSD-17 severity categories (Handbook of Psychiatric Measures) HRSD-17 severity categories N (%) None (0-7) 76 (27.0) Mild (8-13) 71 (25.3) Moderate (14-18) 66 (23.5) Severe ( 19) 68 (24.2) Total scores of the self-complete questionnaires were also found not to deviate significantly from Normal distribution according to the Kolmogorov-Smirnov test at the first time point (HAD-D D=0.06, p=0.31; PHQ-9 D=0.09, p=0.05; BDI-II D=0.07, p=0.18; QIDS-SR 16 D=0.07, p=0.18). At the second time point HAD-D and BDI-II did not differ significantly from Normal distribution (HAD-D D=0.09, p=0.08; BDI-II D=0.09, p=0.07) however PHQ-9 and QIDS-SR 16 did differ (PHQ-9 D=0.11, p=0.01; QIDS-SR 16 D=0.11, p=0.02). Summary statistics of the self complete questionnaires are presented in Table 5. 18

Table 5 Scores of self-complete depression measures (initial time point) Self-complete questionnaire (Time point 1) HAD-D PHQ-9 BDI-II QIDS-SR Self-complete questionnaire (Time point 2) HAD-D PHQ-9 BDI-II QIDS-SR Mean (s.d.) 8.39 (4.73) 11.8 (7.25) 23.7 (12.8) 12.1 (5.73) Mean (s.d.) 7.25 (4.73) 10.6 (7.24) 19.9 (14.0) 10.8 (6.08) Median (Quartiles) 8 (5, 12) 11 (6, 18) 22 (14, 33) 12 (8, 16.5) Median (Quartiles) 7 (3.5, 10) 10 (4.75, 16) 17 (8. 28) 10 (6, 16) N* 270 252 241 265 N* 209 202 198 207 *Where data were complete and total scores could be summed. Reliability Analysis Cronbach s Alphas for each of the self-complete measures at both time points are presented in Table 6. All scales exhibited acceptable internal consistency at both time points. Item-total correlations were also satisfactory for all the questionnaires at both time points with the exception of one item on the QIDS-SR 16 at time point 1 which just fell below 0.4 (0.39). This was the item derived from the highest of the four weight questions. 19

Table 6 Cronbach s Alpha and item-total correlations of self-complete questionnaires at time point 1 and time point 2 Questionnaire HAD-D PHQ-9 BDI-II QIDS-SR Time point 1 Time point 2 Cronbach s α N Item-total correlations Cronbach s α (95% C.I.) (minimum, maximum) (95% C.I.) 0.87 (0.84, 0.89) 269 0.54 0.74 0.89 (0.86, 0.91) 0.92 (0.90, 0.93) 251 0.60 0.80 0.92 (0.90, 0.94) 0.94 (0.93, 0.95) 239 0.47 0.78 0.96 (0.95, 0.96) 0.86 (0.83, 0.88) 264 0.39 0.77 0.89 (0.86, 0.91) N 209 202 198 207 Item-total correlations (minimum, maximum) 0.54 0.79 0.64 0.77 0.51 0.81 0.47 0.74 20

Factor analysis Principal components analysis revealed a uni-dimensional factor structure in the HAD-D, PHQ-9 and QIDS-SR 16 at both time points. For the BDI-II three factors emerged at time point one and two factors emerged at time point two, however the scree plots suggested the presence of one underlying factor at both time points. Figure 2 presents the first time point scree plot to illustrate this. Figure 2 Scree plot of BDI-II at first time point showing three factors with Eigen values greater than one. One underlying factor is apparent. The item variance accounted for by the one factor emerging in the HAD-D, PHQ-9 and QIDS-SR 16 and the first factor of the BDI-II are shown in Table 7. At both time points, most items within each scale had a substantial loading indicating all of them to be factorally valid. For each scale, the coefficients of congruence between time point one and time point two tended towards one. This indicates each scale has a highly robust uni-dimensional factor structure regardless of the time point in which it was measured. 21

Table 7 Item variance accounted for by first factor, range of item loadings and coefficients of congruence Questionnaire Percentage variance measured by 1 st factor HAD-D PHQ-9 BDI-II QIDS-SR Time point 1 56.7 60.7 46.3 48.8 Range on item loadings of 1 st factor Time point 1 0.652 0.834 0.682 0.857 0.513 0.819 0.481 0.844 Percentage variance measured by 1 st factor Time point 2 61.1 61.4 54.8 53.2 Range on item loadings of 1st factor Time point 2 0.648 0.868 0.714 0.822 0.552 0.842 0.553 0.818 Coefficients of congruence (time point 1 to time point 2) >0.999 0.999 0.998 0.998 22

Convergent and discriminant validity For assessment of convergent and discriminant validity, data were only included where the HRSD-17 and the self-complete measures were completed within three days of one another. It was considered that with a maximum time difference of three days, there would be sufficient overlap in reference points. The sample included in this analysis was therefore n=252. In fact 233 (92%) of these observations were within one day of each other making substantial overlap of points of reference. Figure 3 Scatter plots of the HAD-D, PHQ-9, BDI-II and QIDS-SR 16 against HRSD-17 The questionnaires demonstrated good convergent validity in that they all correlated highly with HRSD-17 (HAD-D and HRSD-17 r=0.68; PHQ-9 and HRSD-17 r=0.78; BDI-II and HRSD-17 r=0.74; and QIDS-SR 16 and HRSD-17 r=0.78). Figure 3 demonstrates the convergence in scatter plots where it can be seen that the HRSD- 17 forms a linear relationship with all the measures. For HAD-D and HRSD-17, r 2 =0.47, that is to say the HAD-D sum score accounted for 47% of the variance in the HRSD-17 scores. The PHQ-9, BDI-II and QIDS-SR 16 accounted for 59%, 54% and 61% of the variance in HRSD-17 scores respectively. The measures also correlated highly with the HAD-A. This is to be expected between such closely related constructs as anxiety and depression however (apart from BDI- 23

II) correlations were significantly higher between the self-complete measures and HRSD-17, than with the HAD-A (Table 8). Therefore, all but the BDI-II demonstrated discriminant validity. Table 8 Correlation of the depression severity self-complete questionnaires with the HRSD-17 and with the HAD-A. Questionnaire HAD-D PHQ-9 BDI-II QIDS-SR Correlation with HRSD-17 0.697 0.784 0.748 0.791 Correlation with HAD-A 0.593 0.656 0.685 0.684 William s test p- value 0.012 0.001 0.114 0.003 Convergence of severity banding The distribution of the measures across categories is shown in the bar chart in Figure 4. For the purpose of this observation, data were only included where the HRSD-17 and the self-complete measures were completed within three days of one another. Additionally, only data were included where data were complete and could be summed for all measures so that the exact same sample of participants were being assessed (n= 204). It can be seen that HAD-D categorises a larger proportion of participants in the none category relative to the other measures and the PHQ-9 tends to categorise more participants in the moderate category. Although this bar chart provides an opportunity to assess the relative distribution of categorisation of depression severity it is limited in that it does not inform with regard to agreement in categorisation. Figure 4 Distribution of scores across categories according to each measure (n=204) 90 80 70 60 50 40 30 20 HRSD-17 HAD-D PHQ-9 BDI-II QIDS-SR16 10 0 None Mild Moderate Severe Figure 5 however, shows the convergence in severity banding between each scales and the HRSD-17. HAD-D tended to categorise participants in a milder category than HRSD-17 whereas PHQ-9, QIDS-SR 16 and BDI-II tended to categorise participants in a more severe category. Wilcoxon Signed Rank test for related samples showed these differences to be significant for each measure relative to the HRSD-17 (p<0.05). 24

Figure 5 Convergence of severity banding 140 120 100 80 60 40 <HRSD-17 =HRSD-17 >HRSD-17 20 0 HAD-D (n=247) PHQ-9 (n=233) BDI-II (n=222) QIDS-SR (n=242) Empirically derived cut offs As NICE Guidelines on the management of depression advocate the use of antidepressants where depression is of at least moderate severity, the analysis of optimal cut off in points in the self complete measures was assessed against a HRSD-17 14. Assessment with Receiver Operator Characteristic (ROC) curves allows assessment of the best trade off between sensitivity and specificity. Figure 6 ROC curves of self-complete depression severity measures relative to HRSD-17 cut off of 14 Table 9 presents the area under the ROC curve (AUROC) of each self complete depression measure. The greater the area, the better the test is at discriminating. 25

For example, an area of 1 would indicate the self-complete measure was perfect at discriminating between those above or below the threshold on the HRSD-17 14. An area of 0.5 would indicate the test did no better than chance at discriminating between those above and below the threshold. All four measures were shown to perform significantly better than chance at discriminating between those above and below the threshold (p<0.001). The largest area under the curve occurred with QIDS-SR 16. Table 9 Area under the Receiver Operator Characteristics (AUROC) curve of self-complete depression severity measures relative to a HRSD-17 14 (moderate) cut off Questionnaire N* AUROC curve 95% Confidence Intervals HAD-D PHQ-9 BDI-II QIDS-SR 247 233 222 242 0.834 0.879 0.850 0.894 0.785 0.882 0.837 0.921 0.802 0.899 0.856 0.933 16 *Where data were complete for HRSD-17 and the self-complete questionnaire and where completion occurred within a 3 day interval. When deciding upon a cut off point on a scale, there is always a trade off between specificity and sensitivity, ie optimising sensitivity to ensure no cases are missed will inevitably results in an increased rate of individuals being rated above the threshold who do not meet the criteria. Table 10 shows, for each self-complete depression severity measure, the discriminatory properties at the moderate cut off defined by the scales developers relative to HRSD-17 14. Following the ROC curve analysis, the optimal discriminatory properties are also shown. DORs provide a single indicator from which to assess a scale s discriminatory performance. It is independent of prevalence and depression guidelines (3) state that a diagnostic odds ratio 20 demonstrates sufficient accuracy for use in clinical settings. As such, the QIDS-SR 16 appears superior to the other measures however, the confidence intervals are wide. 26

Table 10 Self-complete depression severity measures: discriminatory performance of detecting moderate depression severity relative to a HRSD-17 14 (moderate) cut off Cut offs for moderate severity as defined by scales developers relative to HRSD-17 14 HAD-D 11 PHQ-9 10 BDI-II 20 QIDS-SR 11 Optimal cut offs for moderate severity derived from ROC curve analysis relative to HRSD-17 14 HAD-D 9 PHQ-9 12 BDI-II 23 QIDS-SR 13 % Sensitivity (95% Confidence intervals) 52 (43 61) 87 (80 93) 84 (78 91) 86 (79 92) 73 (65-81) 76 (68 84) 73 (65 82) 79 (71 86) % Specificity (95% Confidence intervals) 89 (83 94) 69 (60 77) 68 (60 77) 68 (60 76) 76 (68 83) 78 (71 85) 74 (66 82) 85 (79 91) Diagnostic odds ratio (95% Confidence intervals) 8.51 (4.44 16.30) 14.29 (7.27 28.09) 11.58 (6.03 22.20) 12.57 (6.60 23.95) 8.47 (4.78 15.02) 11.46 (6.20 21.17) 7.99 (4.39 14.54) 21.42 (11.03 41.60) Positive predictive value (%) 80 70 72 70 72 74 73 82 Negative predictive value (%) 68 86 82 85 76 80 74 82 27

Responsiveness to change The three-month follow up questionnaire was completed by 214 (76%) participants. Time between completion one and two was not normally distributed (D=0.22; p<0.001). The median time between completion one and two was 98 days (IQR 93, 110). The HRSD-17 interviews were not conducted at the second time point therefore observations of the self-complete measures in terms of responsiveness to change are a relative comparison. Additionally, time point one did not represent the start of a depressive episode nor did it signify the beginning of an intervention therefore this assessment is simply one of the relative responsiveness of the four measures over two points in time. Change in mean scores from time point one to time point two are shown in Table 11. For all of the measures, there is a statistically significant difference in the mean score from time point one to time point two with all showing a reduction in scores. The effect size is small for all the measures indicating their measurement of magnitude of change over time is similar. Table 11 Responsiveness to change over time of self complete depression severity measures Questionnaire Mean (s.d.) Time point 1 HAD-D (n=202) PHQ-9 (n=183) BDI-II (n=180) QIDS-SR16 (n=198) 8.09 (4.81) 11.28 (7.27) 22.76 (12.63) 11.67 (5.64) Mean (s.d.) Time point 2 7.22 (4.78) 10.39 (7.27) 19.53 (13.90) 10.72 (6.09) 95% Confidence Intervals of difference of means (0.37 1.37) (0.01 1.77) (1.85 4.61) (0.27 1.62) Effect size 0.18 0.12 0.26 0.17 28

6 Discussion Summary of main findings The HAD-D, PHQ-9, BDI-II and QIDS-SR 16 all exhibited acceptable internal consistency and robust factor structures indicating one underlying construct in each measure. Additionally, all the measures correlated highly with the HRSD-17. The HAD-D, PHQ-9 and QIDS-SR 16 exhibited discriminant validity but the BDI-II did not. All of the scales differed significantly in how they categorised the severity of depression relative to the HRSD-17. The HAD-D tended to categorise participants in milder categories than the HRSD-17 whereas the PHQ-9, BDI-II and QIDS-SR 16 tended to categorise participants in a more severe category than the HRSD-17. All four questionnaires exhibited a similar measurement of magnitude of change over time. A HRSD-17 14 is indicative of depression symptoms of moderate severity. This threshold is of interest to clinicians as the NICE Guidelines on treatment of depression (2009) advocates the use of antidepressant therapy for patients with symptoms of at least moderate severity. On this basis, best sensitivity and specificity can be reached for detecting moderate severity of depressive symptoms where HAD-D 9, PHQ-9 12, BDI-II 23 or QIDS-SR 16 13. The diagnostic odds ratios indicated QIDS-SR 16 to operate at a greater degree of accuracy than the other measures however the confidence intervals were wide with the lower limit being out with the acceptable range for clinical practice. Strengths and Weaknesses To our knowledge, this has been the first study in the UK to assess those four commonly used depression severity measures in terms of their ability to measure severity of depression. The HRSD-17 is not a diagnostic interview and some may have argued that the self-complete measures ought to be assessed against a diagnostic interview such as the SCID. However the purpose of these scales in UK treatment of depression is not for them to be used as case finding tools but to be used for assessing severity of depression in patients already diagnosed by a clinician. The assessment of severity is to identify appropriate evidence-based treatment options. As such, the HRSD-17, administered by a clinician trained in mood assessment, represents the best standard to assess the self-complete measures against as it is with the HRSD-17 that the evidence base with regard to severity is founded. Furthermore, we were able to establish inter-rater reliability between the clinicians that carried out the assessments. Our strategy involved recruiting practices from a range of settings encompassing urban, rural and suburban locations and with a range of affluence/deprivation. By recruiting completely in the Grampian area, our sample does not include the most deprived of practices. Similarly Grampian does not represent the ethnic diversity of some parts of Scotland. It is uncertain how this may impact on the generalisability of our findings. Only a quarter of patients invited to participate did so. This is perhaps not surprising. Depressive symptoms relating to motivation, concentration and cooccurring anxiety symptoms may have contributed to dissuading potential recruits from agreeing to complete several questionnaires and meet with an unfamiliar 29

clinician. However, the priority for this sample, sought for psychometric assessment, was that it represented an even distribution of patients with differing severity of symptoms of depression. The sample met this criterion. Additionally, the sample was similar, in terms of sex, with patients consulting GPs regarding depression throughout Scotland in 2007/2008 (http://www.isdscotland.org/isd/3711.html). Diagnostic odds ratios were used to assess the relative performance of each scale in identifying patients with moderately severe depressive symptoms. It is worth remembering that two scales with identical diagnostic odds ratios could have very different sensitivity and specificity therefore choice of scale should consider which of these criteria are most important for the given purpose. Although we assessed how the self-complete questionnaires measured change over time, there was no HRSD-17 assessment at the second time point. We could not therefore comment on the accuracy of the questionnaires at measuring change over time but could only comment on the relative magnitude of change the questionnaires observed. This study compared the depression self-complete measures against HRSD-17. In addition to assessing the accuracy of these measures in categorising the severity of depressive symptoms, it would have been of great interest to assess GPs accuracy in gauging the severity of their patients depressive symptoms. Whether GPs assessment of severity was any better or worse than the measures could then have been assessed. Adding this element to the study would have been difficult to achieve as it would have required synchronised timing of three components: psychiatric interview, GP consultation and completion of the questionnaire booklet. As it was, it was sometimes challenging to align timings when patients and interviewing psychiatrists could meet without adding the additional element of a GP appointment at a close time point. Accuracy of GPs assessment should however be a focus of future research. Lowe et al. 2004 found GPs detection of major depression to be poor relative to measures including PHQ-9 and HAD-D. However this was in a sample of patients attending practices rather than an assessment of GPs accuracy in gauging severity in patients already diagnosed with depression. Comparisons with literature Our earlier work raised concerns regarding the validity of the HAD-D and PHQ-9 in terms of their assessment of severity of depression (19). We observed that at least one of the two measures were erroneously categorising the severity of depression. Investigations in Sweden and Australia reached similar conclusions (20,21). The present psychometric assessment, where the HRSD-17 interview assessment was conducted, enables us to conclude that indeed both the HAD-D and the PHQ-9, used with their conventional scoring systems, categorise the severity of depression inaccurately. The HAD-D tends to place participants in a milder category of depression than the HRSD-17 and the PHQ-9 tends to place individuals in a more severe category. This latter tendency is also true of the BDI-II and QIDS-SR 16. Kendrick et al., 2009, in an observational study of the QOF depression severity measures in UK primary care, observed that practices which used the PHQ-9 tended to observe greater severity of symptoms than practices that used the HAD-D. Drawing on evidence of psychometric studies which had assessed the PHQ-9 (18) 30