Explore This Issue
August 2018Comment: In head and neck surgical oncology, we tend to take the findings of our relatively limited number of randomized trials as gospel because they are the “highest level of evidence.” However, these authors from Toronto found that, in most randomized surgical trials in our literature, changing one patient from a nonevent to an event in the treatment arm would change the result to a statistically nonsignificant result. In other words, in most studies, changing the outcome of a single patient completely changes the conclusions of the study. Additionally, the finding that the fragility index (FI) was less than the number of patients lost to follow-up in 71% of cases calls into question the quality and validity of many of these trials that we base critical treatment decisions on. —Andres Bur, MD
How robust are the statistically significant findings in randomized trials in the head and neck cancer literature where surgery was a primary intervention?
Bottom line
The statistical significance of the majority of randomized control trials (RCTs) in the head and neck oncologic literature hinges on only a few events. The calculated fragility index (FI) score was lower than the number of patients lost to follow-up in a majority of cases. The FI helps address the deficits of the threshold P value and may serve as a useful adjunct, in addition to other metrics such as effect size and 95% confidence intervals.
Background: RCTs are the basis for evidence-based medicine and guide clinical decision making. The conclusions drawn from these trials are often based on statistically significant tests that suggest the findings are robust and not spurious in nature. Traditionally, the threshold P value of < .05 has been used to dictate whether an intervention reached statistical significance; however, the P value is frequently criticized for being overly simplistic. Many readers place the same degree of confidence in similar P values, irrespective of additional factors such as sample size or number of outcome events.
The FI has been developed to communicate the limitations of the P value, where the FI score is defined as the minimum number of patients whose status would have to change from a nonevent to an event for statistical significance to be lost. This is done by iteratively adding events to the trial arm with the fewest number of events until the recalculated P value is ≥ .05.
The developers of the FI have demonstrated that 24% of RCTs published in high-impact journals hinge on three or fewer events, and that more than 50% of trials had an FI score that was lower than the number of patients lost to follow-up. The FI tool has yet to be explored in the head and neck surgical patient population.
Study design: Potential articles were identified in PubMed, Embase, and Cochrane without publication date restrictions.
Synopsis: Two reviewers independently screened eligible RCTs reporting at least one dichotomous and statistically significant outcome. The data from each trial were extracted and the FI scores were calculated. Associations between trial characteristics and FI were determined.
In total, 27 articles were identified. The median sample size was 67.5 (interquartile range [IQR] = 42–143) and the median number of events per trial was eight (IQR = 2.25–18.25). The median FI score was one (IQR = 0–2.5), meaning that changing one patient from a nonevent to an event in the treatment arm would change the result to a statistically nonsignificant result, or P > .05. The FI score was less than the number of patients lost to follow-up in 71% of cases. The FI score was found to be moderately correlated with P value (ρ = −0.52, P = .007) and with journal impact factor (ρ = 0.49, P = .009) on univariable analysis. On multivariable analysis, only the P value was found to be a predictor of FI score (P = .001).
Citation: Noel CW, McMullen C, Yao C, Monteiro E, et al. The fragility of statistically significant findings from randomized trials in head and neck surgery [published online ahead of print April 23, 2018]. Laryngoscope. doi: 10.1002/lary.27183.