Tariq Siddiqi; Muhammad Usman; Muhammad Khan; Muhammad Khan; Haris Riaz; Safi Khan; M. Murad; Clifford Kavinsky; Rami Doukky; Ankur Kalra; Milind Desai; Deepak Bhatt

doi:10.4244/EIJ-D-19-00636

Abstract

Aims: The aim of this study was to evaluate the performance of risk stratification models (RSMs) in predicting short-term mortality after transcatheter aortic valve replacement (TAVR).

Methods and results: MEDLINE and Scopus were queried to identify studies which validated RSMs designed to assess 30-day or in-hospital mortality after TAVR. Discrimination and calibration were assessed using C-statistics and observed/expected ratios (OERs), respectively. C-statistics were pooled using a random-effects inverse-variance method, while OERs were pooled using the Peto odds ratio. A good RSM is defined as one with a C-statistic >0.7 and an OER close to 1.0. Twenty-four studies (n=68,215 patients) testing 11 different RSMs were identified. Discrimination of all RSMs was poor (C-statistic <0.7); however, certain TAVR-specific RSMs such as the in-hospital STS/ACC TVT (C-statistic=0.65) and STT (C-statistic=0.66) predicted individual mortality more reliably than surgical models (C-statistic range=0.59-0.61). A good calibration was demonstrated by the in-hospital STS/ACC TVT (OER=0.99), 30-day STS/ACC TVT (OER=1.08) and STS (OER=1.01) models. Baseline dialysis (OER: 2.64 [1.88, 3.70]; p<0.001) was the strongest predictor of mortality.

Conclusions: This study demonstrates that the STS/ACC TVT model (in-hospital and 30-day) and the STS model have accurate calibration, making them useful for comparison of centre-level risk-adjusted mortality. In contrast, the discriminative ability of currently available models is limited.

Introduction

The European Society of Cardiology/European Association for Cardiothoracic Surgery (ESC/EACTS) guidelines recommend transcatheter aortic valve replacement (TAVR) instead of surgical aortic valve replacement (SAVR) to improve survival and/or symptoms in patients with aortic stenosis who are at intermediate to high surgical risk1. Recent evidence suggests that the recommendation for TAVR might be extended to low surgical risk patients as well2. Although the use of TAVR is increasing, selection for TAVR of candidates in whom the expected benefits of the intervention outweigh the risks remains a challenge. Accurate risk stratification models (RSMs) can aid this process by determining the probability of a futile procedure, thereby helping to avoid hopeless procedures and simplifying treatment decisions. Initially, surgical RSMs such as the Society of Thoracic Surgeons (STS) score and the European System for Cardiac Operative Risk Evaluation (EuroSCORE) were used for this purpose3. However, their prognostic value has been questioned, and concerns have been raised that they tend to overestimate mortality risk.

Consequently, multiple RSMs have been developed from TAVR populations; however, their reliability is not well established, and it remains unclear which of these RSMs is optimal for clinical use4^,5^,6^,7^,8^,9^,10. Furthermore, the external generalisability of these models is limited given the heterogeneous patient populations, procedural and operator-specific factors. Therefore, pooling data from different validation studies can provide a more accurate assessment of the performance of the RSM compared to individual studies. The purpose of this study was to analyse systematically the clinical practicability, productiveness and discriminative performance of each RSM by conducting a meta-analysis using data from all studies validating the particular RSM. Furthermore, we aimed to assess whether TAVR-dedicated risk scores are superior to surgical risk scores in predicting survival. In addition, we sought to review the predictors used by each RSM and evaluate which patient-specific parameters were the best predictors of post-TAVR mortality.

Methods

This systematic review and meta-analysis was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines11.

Details on search strategy (Supplementary Table 1), study selection, data extraction and quality assessment are provided in Supplementary Appendix 1,5^,10.

EFFECT SIZE ESTIMATION

Discrimination and calibration are relative and absolute measures, respectively, that are essential to have in a useful and reliable RSM. Discrimination is defined as the ability of an RSM to yield a higher “risk” for individuals who experience an event in the future, when compared with patients who do not experience the event. To evaluate discrimination, we used the C-statistic (also known as “area under the curve” [AUC]). The C-statistic ranges from 1.0 (perfect concordance between model-based risk estimates and observed events) to 0.5 (random concordance). C-statistic values have been categorised as follows: (a) 0.81-0.90 = good; (b) 0.71-0.80 = fair; (c) 0.61-0.70 = poor; and (d) 0.50-0.60 = very poor/almost no association12. For this meta-analysis, C-statistics and their corresponding 95% confidence intervals (CIs) were extracted from each validation study. The 95% CIs were used to compute standard errors (SEs).

Calibration is the measure of how accurately the model’s predictions match overall observed events in a cohort of patients (observed/expected ratio [OER]). OERs of ~1 suggest good calibration. OERs >1 suggest underprediction, while ratios <1 suggest overprediction. From each study, we extracted the expected mortality (as predicted by the risk model) and the observed (actual) mortality. These values were then used to compute the observed – expected (O-E) value and the variance, using an online calculator (http://www.hutchon.net/peto%20vers%202.html).

STATISTICAL ANALYSIS

We performed a meta-analysis on the C-statistics and corresponding SEs using an inverse variance random-effects model to determine the pooled discrimination. Before pooling, logit transformation of the C-statistic values was carried out. The OER and variance were measured using the Peto odds ratio method. The OERs from each study validating a particular model were pooled together for accurate estimation of the calibration of that scale. Log transformation of the OER values was done prior to pooling. We also sought to assess the association of specific predictors with short-term mortality. A covariate was selected for meta-analysis if data (odds ratios [OR] and 95% CIs) on it were provided by at least two studies. Q statistics and Higgins I² were used to evaluate heterogeneity across studies and a value of I²=25%-50% was considered mild, 50%-75% as moderate, and >75% as severe. A p-value of <0.05 was considered significant for all analyses. Review Manager, Version 5.5 (Cochrane Collaboration, Oxford, UK) was used to perform the statistical analyses.

Results

SEARCH RESULTS

The initial search produced 6,099 articles; 2,930 were reviewed at title and abstract level and an additional 2,906 articles were removed based on predetermined selection criteria. Ultimately, we included 23 full articles and one abstract (Sirotina al. Utility of conventional surgical risk scores in predicting outcome after transcatheter aortic valve replacement, presented at American College of Cardiology (ACC) 2013 Scientific Sessions, 9 March 2013, San Francisco, CA, USA). A total of 68,125 patients from these studies were included in the analysis (Figure 1),4^,5^,6^,7^,8^,9^,10^,13^,14^,15^,16^,17^,18^,19^,20^,21^,22^,23^,24^,25^,26^,27^,28^,29. These 24 studies tested 11 different RSMs (7 TAVR-specific, 3 surgical, and 1 designed for use in both TAVR and SAVR patients). Supplementary Table 2 provides a list of all included studies along with relevant study characteristics. Supplementary Table 3 displays the predictors that make up each included RSM. Assessment of risk of bias using the PROBAST scale revealed that all the new TAVR-specific models were developed using robust methodological methods (Supplementary Table 4). Similarly, all of these models were found to have good applicability except for the UK TAVI CPM, which was adjudicated to have low applicability as it was derived from a small, selected population.

Figure 1. PRISMA flow chart outlining the literature search.

The summarised forest plots display the pooled discrimination (Figure 2) and calibration (Figure 3) of each RSM. The detailed forest plots are provided in Supplementary Figure 1-Supplementary Figure 4.

Figure 2. Summarised forest plot displaying results of meta-analysis of discrimination of each risk stratification model. AUC: area under the curve; FRANCE 2: FRench Aortic National CoreValve and Edwards; German AV Score: German aortic valve score; OBSERVANT: Observational Study Of Appropriateness, Efficacy And Effectiveness of AVR-TAVR Procedures For the Treatment Of Severe Symptomatic Aortic Stenosis; STS/ACC TVT: Society of Thoracic Surgeons/American College of Cardiology Transcatheter Valve Therapy; STS-PROM: Society of Thoracic Surgeons Predicted Risk of Mortality; STT: survival posT TAVI; UK TAVI CPM: UK transcatheter aortic valve implantation clinical prediction models

Figure 3. Summarised forest plot displaying results of meta-analysis of calibration of each risk stratification model. AUC: area under the curve; FRANCE 2: FRench Aortic National CoreValve and Edwards; German AV Score: German aortic valve score; OBSERVANT: Observational Study Of Appropriateness, Efficacy And Effectiveness of AVR-TAVR Procedures For the Treatment Of Severe Symptomatic Aortic Stenosis; STS/ACC TVT: Society of Thoracic Surgeons/American College of Cardiology Transcatheter Valve Therapy; STS-PROM: Society of Thoracic Surgeons Predicted Risk of Mortality

TAVR-SPECIFIC MODELS

STS/ACC TVT

Meta-analysis of 2016 and 2018 in-hospital risk models demonstrated a C-statistic of 0.65 (95% CI: 0.62-0.68; I²=0%) and an OER of 0.99 (95% CI: 0.92-1.07; I²=82%), indicating poor discrimination and good calibration, respectively. We could not estimate the discrimination of the 30-day model due to lack of data. The OER for this model was 1.08 (95% CI: 0.93-1.25). The 30-day mortality model has not yet been externally validated as of March 2019.

OBSERVANT

The model was found to have a poor discrimination (C-statistic: 0.57; 95% CI: 0.54-0.60; I²=0%) and a significantly over-predictive calibration (OER: 0.75; 95% CI: 0.65, 0.87).

FRANCE 2

The pooled results demonstrated poor discrimination (C-statistic: 0.61; 95% CI: 0.59-0.64; I²=13%). The calibration of the scale was found to be significantly over-predictive for 30-day mortality (OER: 0.57; 95% CI: 0.50-0.65; I²=0%).

COREVALVE

This model demonstrated a fair discriminative ability (C-statistic: 0.75; 95% CI: 0.35-1.15); however, a wide confidence interval makes this result unreliable. The OER was not reported by the single study validating this model. To the best of our knowledge, this RSM has not been externally validated.

STT (SURVIVAL POST TAVI)

The STT model demonstrated poor discriminative ability (C-statistic: 0.66; 95% CI: 0.56-0.76). The OER was not reported. Our search revealed no studies which externally validated this model and met the inclusion criteria.

UK TAVI CPM

This model demonstrated a poor discriminative ability (C-statistic: 0.66; 95% CI: 0.61-0.71). The OER was not reported in the publication in which this model was derived and validated. This model has not yet been validated in an external sample.

GERMAN AV SCORE

This model showed a very poor discrimination (C-statistic: 0.59; 95% CI: 0.56-0.62) and a significantly over-predictive calibration (OER: 0.72; 95% CI: 0.62-0.82).

SAVR-SPECIFIC MODELS

STS SCORE

This surgical risk model showed a poor discrimination (C-statistic: 0.60; 95% CI: 0.58-0.64; I²=34%); however, the calibration was good (OER: 1.01; 95% CI: 0.90-1.13; I²=70%).

LOGISTIC EUROSCORE

This showed very poor discrimination (C-statistic: 0.59; 95% CI: 0.56-0.62; I²=54%). Similarly, this model showed a significantly over-predictive calibration (OER: 0.30; 95% CI: 0.27-0.33; I²=88%).

EUROSCORE II

This model showed poor discrimination (C-statistic: 0.61; 95% CI: 0.58-0.64; I²=30%). The calibration of this model was over-predictive (OER: 0.79; 95% CI: 0.71-0.88; I²=80%).

P-INTERACTION BETWEEN SUBGROUPS

The overall p-interactions for both discrimination (p=0.03) and calibration (p<0.001) signify significant differences between subgroups. Supplementary Table 5 and Supplementary Table 6 give p-interaction values between individual subgroup pairs in the discrimination and calibration analysis, respectively.

PREDICTORS OF SHORT-TERM MORTALITY

Baseline dialysis was the strongest predictor of short-term mortality (OR: 2.64 [1.88, 3.71]; p<0.001; I²=0%). Figure 4 displays all the predictors studied.

Figure 4. Forest plots displaying the association of each predictor with short-term mortality. Baseline dialysis (A) was the strongest predictor of short-term mortality, followed by critical preoperative state (B), non-femoral access site (C), NYHA Class IV (D), pulmonary hypertension (E), home oxygen use (F), age greater than 85 (G), and GFR (per 5-unit decrease) (H). GFR: glomerular filtration rate; NYHA: New York Heart Association; PAH: pulmonary arterial hypertension

Discussion

This meta-analysis of 68,215 patients shows that RSMs designed specifically for TAVR patients show poor discrimination (C-statistic range: 0.57-0.66); however, some of these models, such as the in-hospital STS/ACC TVT (C-statistic=0.65), STT (C-statistic=0.66), and UK TAVI CPM (C-statistic=0.66) predicted individual mortality more reliably than surgical models (C-statistic range: 0.59-0.61). Amongst the new TAVR-specific models that reported data on calibration, the STS/ACC TVT (both the in-hospital as well as the 30-day mortality versions) had the best performance. When both discrimination and calibration were considered together, the in-hospital STS/ACC TVT was the best performing RSM. Amongst the individual parameters analysed, baseline dialysis and non-femoral access site were the strongest predictors of 30-day mortality.

Globally, in the last few years, TAVR has been performed in more than 400,000 patients and indications keep growing at a rate of 40% annually27. This has presented the need for RSMs that can predict 30-day mortality, thereby allowing patient selection and provider comparisons27. Due to the lack of TAVR-specific models initially, several investigators tested the usefulness of surgical RSMs in assessing the risk of mortality in patients undergoing TAVR. However, valid concerns were raised about the limitations of surgical models. For example, these models do not include crucial factors that are strongly believed to affect candidacy for TAVR, such as home oxygen use, access site, assessments of frailty, and consideration of functional disabilities. Since 2014, several TAVR-specific models have emerged. However, reports concerning the applicability of these TAVR-specific RSMs have varied markedly in their ﬁndings.

A model with a discriminative capacity of C>0.80 provides strong support to guide medical decision making and can reliably dictate whether a patient will experience an event. Strongly discriminative models can also be relevant for research purposes, such as covariate adjustment in RCTs. Unfortunately, our study found that neither surgical nor TAVR-specific risk models currently meet the threshold of C>0.80. The highest C-statistic was of the CoreValve model (C-statistic=0.75), but it was unreliable due to a wide 95% CI (0.35-1.15). This unreliability may be because only a single, relatively small-sized study developed and validated this RSM, and due to the lack of external validation studies. The discriminative ability of the CoreValve model will become clearer as additional studies validate it. When both the C-statistic and 95% CI are considered, the in-hospital STS/ACC TVT model currently appears to have the best discrimination (C-statistic: 0.65; 95% CI: 0.62-0.68). We were only able to perform a meta-analysis on the C-statistics from an older version of this model; an updated version demonstrated an even better C-statistic reaching up to 0.70 for in-hospital mortality and 0.71 for 30-day mortality10. However, there is still room for improvement. For example, other cardiovascular risk models, such as the ones for the management of heart failure and percutaneous coronary intervention, demonstrate C-statistics >0.80 for 30-day mortality30. There could be a couple of explanations as to why the TAVR-specific risk models do not currently achieve this level of discrimination. First, it could be due to limitations in the model, such as an insufficient number of predictors or due to predictors being dichotomised for simplicity. Additionally, relatively small and homogenous derivation cohorts, and absence of validation in external data sets could also be responsible. If this is the case, additional data (for example, from the continuously growing TVT registry), along with periodic model refinements will probably improve the discrimination. Regular model updates using the most recent outcome data are particularly important in a rapidly evolving field such as TAVR, where device and procedural advances have been shown to reduce periprocedural complications significantly, as reflected by a large heterogeneity of reported outcomes across major studies22. A second reason for the weak discrimination could be the inherent inability to discriminate between patients who will or will not die post TAVR. However, a poorly discriminating model (e.g., C~0.6), may be useful (when used in conjunction with clinical judgement) in a situation that does not have one outcome or choice that is clearly better or more likely than another.

RSMs with a good calibration (OER ~1) are useful for benchmarking and comparison of centre-level risk-adjusted outcome. This can be used by providers and sites to spur quality improvement, resulting in improved outcomes in patients with different risk profiles. According to our study, both the STS/ACC TVT (in-hospital and 30-day versions) and STS models demonstrate good calibration and may be used for this purpose. Our study demonstrates that there is considerable heterogeneity in the covariates incorporated in the TAVR-specific risk prediction models. This underscores the need for combining these covariates to form an RSM that outperforms the currently available RSMs.

Limitations

This meta-analysis has limitations that need to be considered when interpreting the results. First, this meta-analysis is based only on retrospective observational studies and some bias may be present as not all parameters may have been available for calculation in the risk models. In the future, large prospective validation cohorts are needed to assess the accuracy of such RSMs and validate our results. Second, some validation studies had to be excluded from our analysis as relevant data were not provided, which could have contributed to bias. Third, these estimates are derived from individual studies as we did not have access to the individual patient data. Fourth, most of these models were derived from patient populations with high to intermediate risk. Amongst the low-risk patient population, comorbidities are a less relevant part of risk scores to predict outcomes; other factors such as anatomical and procedural variables may be more important but are traditionally not included in RSMs. The publication of studies in lower-risk populations (such as the PARTNER 3 and Evolut trials) is likely to shift the TAVR use to lower-risk patients; the applicability of these scales in a lower-risk population is currently not known. While the focus of this manuscript is short-term mortality, it must be noted that it is not the only outcome driving clinical decisions. Long-term efficacy, functional outcomes and quality of life are also important and must be considered.

Conclusions

In conclusion, our study demonstrates that the in-hospital STS/ACC TVT model, the 30-day STS/ACC TVT model, and the STS model have accurate calibration in predicting short-term mortality. This makes these models useful for comparison of centre-level risk-adjusted mortality. In contrast, the discriminative ability of currently available models is limited, and room for improvement exists before wide clinical implementation.

Impact on daily practice

This study demonstrates that the STS/ACC TVT models (in-hospital and 30-day) and the STS model have accurate calibration and can therefore help physicians and administrators to compare centre-level risk-adjusted mortality. Discrimination of all RSMs was poor, and room for improvement exists before these can be used to predict the risk of individual patient mortality reliably. This study also reviews the predictors that make up each RSM and highlights the strongest predictors of mortality, which can assist in the development of new, better-performing models.

Conflict of interest statement

D.L. Bhatt discloses the following relationships – Advisory Board: Cardax, Elsevier Practice Update Cardiology, Medscape Cardiology, PhaseBio, Regado Biosciences; Board of Directors: Boston VA Research Institute, Society of Cardiovascular Patient Care, TobeSoft; Chair: American Heart Association Quality Oversight Committee; Data Monitoring Committees: Baim Institute for Clinical Research (formerly Harvard Clinical Research Institute, for the PORTICO trial, funded by St. Jude Medical, now Abbott), Cleveland Clinic (including for the ExCEED trial, funded by Edwards), Duke Clinical Research Institute, Mayo Clinic, Mount Sinai School of Medicine (for the ENVISAGE trial, funded by Daiichi Sankyo), Population Health Research Institute; Honoraria: American College of Cardiology (Senior Associate Editor, Clinical Trials and News, http://ACC.org; Vice-Chair, ACC Accreditation Committee), Baim Institute for Clinical Research (formerly Harvard Clinical Research Institute; RE-DUAL PCI clinical trial steering committee funded by Boehringer Ingelheim), Belvoir Publications (Editor in Chief, Harvard Heart Letter), Duke Clinical Research Institute (clinical trial steering committees), HMP Global (Editor in Chief, Journal of Invasive Cardiology), Journal of the American College of Cardiology (Guest Editor; Associate Editor), Medtelligence/ReachMD (CME steering committees), Population Health Research Institute (for the COMPASS operations committee, publications committee, steering committee, and USA national co-leader, funded by Bayer), Slack Publications (Chief Medical Editor, Cardiology Today’s Intervention), Society of Cardiovascular Patient Care (Secretary/Treasurer), WebMD (CME steering committees); Other: Clinical Cardiology (Deputy Editor), NCDR-ACTION Registry Steering Committee (Chair), VA CART Research and Publications Committee (Chair); Research Funding: Abbott, Amarin, Amgen, AstraZeneca, Bayer, Boehringer Ingelheim, Bristol-Myers Squibb, Chiesi, Eisai, Ethicon, Forest Laboratories, Idorsia, Ironwood, Ischemix, Lilly, Medtronic, PhaseBio, Pfizer, Regeneron, Roche, Sanofi Aventis, Synaptic, The Medicines Company; Royalties: Elsevier (Editor, Cardiovascular Intervention: A Companion to Braunwald’s Heart Disease); Site Co-Investigator: Biotronik, Boston Scientific, St. Jude Medical (now Abbott), Svelte; Trustee: American College of Cardiology; Unfunded Research: FlowCo, Fractyl, Merck, Novo Nordisk, PLx Pharma, Takeda. The other authors have no conflicts of interest to declare.