The Statistical Fragility of Management Options for Acute Achilles Tendon Ruptures – A Systematic Review of Randomized Control Trial with Fragility Analysis

S T Importance: In the treatment of acute Achille ’ s tendon rupture, there is no uniform consensus on which of the many treatment modalities for this common injury is superior with respect to all possible complications. This review is to assess the statistical quality of the available evidence. Objectives: The P value is the common method to determine the signi ﬁ cance of a ﬁ nding, but it does not convey statistical robustness. The reversal of a small number of outcome events can be enough to change a ﬁ nding of signi ﬁ cance; this is known as statistical fragility, which can be measured with the fragility index (FI) and fragility quotient (FQ). The purpose of this study was to examine the statistical fragility of randomised control trials (RCTs) reporting outcomes of acute Achille ’ s tendon rupture (AATR) management. Evidence review: A systematic search strategy was used to ﬁ nd RCTs published since 1990 investigating AATR management. The FI was calculated using Fisher ’ s exact test by sequentially altering the number of events until there was a reversal of signi ﬁ cance. The FQ was calculated by dividing the FI by the sample size. Each trial was assigned an overall FI and FQ calculated as the median result of its reported ﬁ ndings. Findings: Overall,55RCTsmettheinclusioncriteria,including4,205patients,82.7%ofwhichweremale,therewasa mean age of 41 and follow-up of 21 months; 60% of RCTs either did not report a statistical power analysis or were statistically underpowered. The overall FI was 4, indicating the reversal of just four outcomes would change the sig- ni ﬁ cance ﬁ nding. The overall FQ was 0.082, suggesting that reversing eight patients out of every 100 would alter signi ﬁ cance.In22/55(40%)RCTs,thenumberofpatientslosttofollow-upwasgreaterthanorequaltotheFIofthetrial. Conclusion: This review indicates the RCT literature for AATR management may be vulnerable to statistical fragility, with a handful of events required to reverse a ﬁ nding of signi ﬁ cance. We recommend that future trials in this area report the FI, FQ, and P value, to aid readers in assessing the evidence, therefore impacting clinical decision making. Level of evidence: I; Systematic Review of Randomised Control Trials.


Introduction
Acute Achilles tendon rupture (AART) is a common injury, with approximately 2.5 ruptures per 100,000 patient-years, and its incidence is trending upwards [1][2][3][4][5]. Young physically active populations, particularly males, suffer the majority of AART, with another peak in an older non-active cohort [3]( [6]). Management of AART may be surgical or conservative, with a variety of accepted rehabilitation schemes [7]. However, there remains no uniform consensus on the optimal approach to the management of AART despite being a prolific area of research [8][9][10][11][12].
Objective data is required for the practising of evidence-based medicine [13]. With respect to this, level I evidence from randomised control trials (RCT) is at the top of the hierarchy of evidence used to inform clinical practice [13]. RCTs are difficult to conduct in orthopaedics due to small sample sizes, difficulty blinding participants and clinicians, patients rejecting randomisation due to a treatment preference and the challenge of follow-up over long time periods [14,15]. Therefore, P values are set in an arbitrary fashion to α ¼ 0.05 and are the nearly universal tool for determining the statistical significance of trial findings [16].
However, due to the often small sample sizes utilised in RCTs, which report statistically significant findings, changing a relatively small number of event outcomes may reverse the findings of statistical significance [17]. The fragility index (FI) is a concept first described by Feinstein; it is the minimum number of events that must be reversed to change the significance of an RCT's findings using Fischer's exact test [18]. The FI unlike P value, has no arbitrary point at which it is deemed significant. The FI exists independently of the sample size from which it is calculated. The fragility quotient (FQ) described by Ahmed is calculated by dividing the FI by the study's population size [19]; therefore, expressing the fragility of the findings relative to the sample size of the trial, which gives context to the degree of fragility and is useful for comparison.
To the best of the authors' knowledge, the use of FI and FQ statistical analysis has not been performed exclusively on RCT level one evidence examining the management of AATR before. The purpose of this study was to examine the statistical fragility of RCTs reporting outcomes of AATR management. Our hypothesis was that included studies would be consistently fragile to a reversal of their stated findings and that the FI would be comparable to the number lost to follow-up (LTFU).

Search strategy
In reference to Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, two independent reviewers performed a systematic review of the literature in August 2021, including three databases (PubMed, Scopus, and Embase) [20]. The search terms utilised were (Achilles tendon OR tendoachilles OR tendo Achilles OR tendoachillis OR tendo Achillis OR calcaneal tendon OR tendocalcaneus OR tendo calcaneus) AND (rupture OR ruptures OR tear OR tears OR lesion OR lesions OR injury OR injuries) AND (treatment OR intervention OR management OR repair). The texts discovered using this search strategy were screened by both independent reviewers, with the removal of duplicate studies, followed by the application of our eligibility criteria.

Eligibility criteria
The inclusion criteria for inclusion were (1) RCTs that investigate the management of AATRs, (2) reporting dichotomous outcomes and statistical significance, (3) full-text studies published in the last 30 years, (4) published in a peer-review journal, and (5) published in the English language. The exclusion criteria were (1) RCTs without a clear randomisation protocol, (2)] review articles, (3) studies in vitro, and (4) studies involving animals. In cases of discrepancies between the two independent authors occurring in relation to a study meeting the inclusion or exclusion criteria, any disagreements were decided upon by the senior author.

Assessment of evidence
All included studies were assessed for their reported level of evidence, using the Journal of ISAKOS criteria [21]. The modified Jadad scale with a maximum score of 8 was used to assess the quality of evidence (QOE) of the RCTs, with a score of 4 or more being considered a high QOE and a score of 3 or less as a low QOE [22]. All studies were assessed for the presence and nature of a statistical power analysis [23]. The 2021 impact factor of the publication journal was recorded.

Data extraction
Following the application of our pre-determined inclusion and exclusion criteria, the following variables were collected by both reviewers from included studies and entered into a password-protected database on Microsoft excel: (1) year of publication, (2) randomisation methods, (3) statistical power analysis (type of analysis and reported power), (4) the primary and secondary outcomes as specified in the trial protocol, (5) length of follow-up (months), (6) the number of participants included in each of the treatment arms, (7) mean age of participants (years), (8) gender of participants, (9) number as protocol, number per protocol, numbers LTFU (10) the reported significance of each event, and (11) all dichotomous outcomes of interest. This included re-rupture and partial re-rupture rates, excessive tendon lengthening, sural nerve injury, all wound infections and complications, return to sport and return at preinjury level, as well as deep vein thrombosis, as protocol describes the number of patients in a trial, who were randomised to a study arm and received that treatment. Per protocol is defined as the number of patients who remain at the end of the follow-up period; it is the difference between AP and LTFU [24].

Statistical analysis
The FI was calculated using free online GraphPad Software [25]. For every dichotomous finding reported, both the events and non-events for each of the treatment arms were entered into the 2 Â 2 grid, and a two-tailed Fisher's exact test used to calculate the P value, with α ¼ 0.05. This step is imperative as some reported P values were calculated using the Chi-squared test. In calculating the FI the 2 Â 2 grid is manipulated until there is a reversal of the significant finding.
For any given outcome reported as significant, the number of events required to make P > 0.05 was calculated by adding þ1 to the events of the treatment arm, which had lesser events and À1 was removed from the non-events to maintain the population of that treatment arm. This process was repeated until the result became non-significant. Conversely, for findings, which were not significant, the number of events required to decrease P to <0.05 was calculated by adding þ1 to the treatment arm, which had more events, and À1 from the non-events to maintain the population of that treatment arm, as shown in Fig. 1. This process was repeated until the result became significant. The number of events changed was recorded as the FI for that finding. Each study was given an overall FI, and this was calculated by finding the median FI of the findings in that particular study; the interquartile range (IQR) was also calculated, illustrating the central distribution of the FI. For each finding, the FQ was calculated in Microsoft excel by dividing the FI by the per protocol number for that study. The overall median FQ and IQR for each study was calculated in the same manner as the FI. We used Pearson's correlation coefficient when assessing for direct correlation.

Literature search
Following our initial search, a total of 12,359 studies were returned. Following manual removal of duplicate studies, 6,828 studies remained for application of our eligibility criteria. Thereafter, the titles and abstracts were evaluated, yielding 223 studies for full-text review, as illustrated in Fig. 2.
Overall, there were 51 studies meeting the eligibility criteria, therefore, warranting inclusion in this systematic review. Of these publications, Costa 2006 [26] was, in fact, two separate RCTs; Eliasson 2018 [27] and Manent 2019 [28] were three armed studies, which had to be subdivided to compare one treatment arm against another and make their trichotomous results become dichotomous. This resulted in a total of 55 trials to be assessed. The mean Jadad score was 5.7 AE 1.1, with a range of 3-7.5, and 49/51 trials being scored as high QOE. The current impact factor of the journals the included studies are published in had a mean of 6.467 AE 11.2 and a range of 1.286-79.321. The eligible RCTs included 4,205 patients, of which 82.7% were male, and there was a mean age of 40.3 AE 4.1 years and a mean follow-up of 21 AE 28.1 months. The included studies are displayed in Appendix 1.

Surgical comparisons
There were 13 RCTs comparing surgical methods. OR vs percutaneous repair (PR) were compared in six RCTs with a total of 345 participants, who had a mean age of 40 AE 1.7, were followed up for 15.5 AE 7.4 months, and 80.2% were male. They had a median FI of 4.25 and FQ of 0.086 [39][40][41][42][43][44]. OR vs MIS were compared in three RCTs with a total of 143 participants with a mean age of 47.6 AE 7.1 years, who were followed up on average for 23.4 AE 0.8 months and were 88.7% male. They had a median FI of 4 and an FQ of 0.085 [45][46][47]. OR vs augmented open repair (AOR) were compared in four RCTs, with a total of 174 participants who had a mean age of 38.2 AE 1.8 years, were followed up on average 57.6 AE 64.2 months and were 83.9% male. They had a median FI of 4.25 and FQ of 0.108 [48][49][50][51].

Platelet-rich plasma vs Conventional treatment
There were four trials investigating platelet-rich plasma (PRP). Two RCTs compared OR vs OR and PRP with 66 participants, mean age of 38.2 AE 2.7 years and an average follow-up of 18 AE 6 months and 88.7% were males. They had a median FI of 4 and FQ of 0.130 [52,53]. The other two RCTs compared CT and PRP vs CT and placebo, with a total of 269 participants who had a mean age of 43 AE 2.5 years, were followed for an average of 9 AE 3 months, and 87.6% were males. They had a median FI of 5.75 and FQ of 0.093 [54,55].

Intermittent pneumatic compression vs conventional treatment
OR vs OR and intermittent pneumatic compression (IPC) were compared in three RCTs, with a total of 332 participants who had a mean age of 40.2 AE 0.2 years, were followed up for an average of 5 AE 4.9 months, and 81.8% were male. They had a median FI of 5 and an FQ of 0.043 [56][57][58].

Early vs later rehabilitation
Twenty-three RCTs compared early vs later rehabilitation. OR with early vs later rehab were compared in 14 RCTs with a total of 872 participants, who had a mean age of 38.5 AE 2.1 years, were followed up on average for 25.6 AE 34.6 months, and 81.8% were male. They had a median FI of 4.5 and an FQ of 0.087 [26,27,[59][60][61][62][63][64][65][66][67][68][69]. CT with early vs later rehab was compared in seven RCTs, with a total of 923 participants who had a mean age of 43.7 AE 5.2 years, were followed up on average for 17.6 AE 14.9 months, and 78.7% were male. They had a median FI of 4.5 and an FQ of 0.085 [26,[70][71][72][73][74][75].
There was one RCT for MIS early vs later rehab, which had 38 participants who were all male, had a mean age of 41.6 AE 9.5 years and were followed up for 3 months. The trial FI was 4.5 and FQ 0.082 [76]. There was one RCT comparing PR early vs later rehab, which had 60 participants, of which 76.7% were male, and had a median age of 43 (range 19-65), and they were followed up for an average of 12 months. The trial FI was 5.5 and FQ was 0.172 [77].

Fragility index and quotient
A total of 246 dichotomous events were recorded from the 55 included trials. The overall median FI was 4 [IQR, [4][5], and the median FQ was 0.082 [0.056-0.115]. The median number of patients with LTFU was 3 [0-7]. When assessing each event, we recorded whether it was reported as a significant finding [P < 0.05] or non-significant [P > 0.05], if it was a primary or secondary study finding, and also the number of patients lost to follow-up (LTFU) in that trial. We performed a subgroup analysis, as shown in Table 1 suggesting that it was more robust. Outcomes, which were reported as significant, were more fragile with a lower median FI of 3 [1.5-4] than those reported as non-significant with an FI of 4 [4][5]. There was also a substantial difference between the FQ of the two groups, with significant results having a much lower median FQ of 0.036 than nonsignificant results with a median FQ of 0.083. While primary and secondary outcomes had surprisingly similar fragility, primary outcomes had a median FI of 4 [4][5]   The subgroup analysis of outcomes included tendon re-rupture, excessive tendon lengthening, sural nerve injuries, deep infections, return to sport and return at the pre-injury level. The FI and FQ of these subgroups were devoid of major outliers, and the fragility of the outcomes was consistent within the group and when compared to the overall median FI and FQ, as shown in Table 1.

Power analysis
Nineteen publications failed to provide a power analysis, although Zou et al. casually stated their study was underpowered due to a small sample size, and Schepull et al. stated they could not perform a priori analysis due to the novel nature of the study 1 . Six publications reported a posteriori power analysis, of which three met a minimum power of 80% [44,50,51], and three were underpowered [32,56,61]. Twenty-five publications reported a priori power analysis, of which one did not report the findings [78], six were underpowered, meaning they did not meet the sample size required by their a priori analysis, 2 and 18 had a sufficient sample size to meet a minimum a priori requirement of 80% power 3 Fig. 3. We observed an association between higher powered studies and those with higher FIs. Data are shown fully in Table 2, with IQR being informative to the distribution of the data (see Table 3).
We did not observe a strong relationship between the median FI and the AP or PP population of a trial. The Pearson's correlation coefficient between AP trial population and median trial FI was R (53) ¼ 0.52, p < 0.001 and between PP trial population and median FI was R (53) ¼ 0.51, P < 0.001. This suggests a moderate positive correlation between larger study size and being less fragile or having a greater FI.   Fig. 4. The Pearson's correlation coefficient between total AP trial participants in a trial and the number LTFU was R (53) ¼ 0.44, P < 0.001.

Discussion
This systematic review and fragility analysis of RCTs studying AATR showed that even high-quality level one evidence in this field is vulnerable to fragility, with a handful of outcome reversals enough to change a significant finding. Significant findings that inform clinical practice in the management of AATR, including tendon re-rupture, excessive tendon lengthening, infections, and RTP, consistently display moderate levels of fragility.
Of the 55 trials included in our review, 40% had a FI the number LTFU, and of the 246 included events, 38% had an FI the number LTFU. These figures are of some cause for concern, and it is possible that in approximately two-fifths of the RCTs on AATR, the significance finding may have been inverted had the trial population been more fully followed up. It should also be noted that events that had an FI LTFU had a lower FQ than those FI > LTFU (FQ ¼ 0.071 and 0.098), suggesting that as a group, they were more fragile. Of the 55 trials, only 41% reported that they had appropriate statistical power, and this group had the most robust median and IQR FI. The 59% of studies that had failed to report power or were statistically underpowered are at risk of type II data errors or a failure to detect a statistical difference when one does exist.
Statistically significant moderate to weak correlations between FI and sample size has been shown before, as was the case in this review [17,81,82]. It is intuitive that studies with more participants will generally be more statistically robust. However, readers should not automatically assume a larger trial size prevents fragility. We found the journal impact factor to have a moderately positive correlation with median FI. This correlation was not strong, and the reputation of the journal should not be relied upon to ensure robustness. It was of interest that the correlation between the number LTFU and median FI was weakly positive. A large, appropriately powered study with good methodology may have a large number of LTFU simply due to the scale of the trial and report statistically robust results, as was the case for Keene et al. who recruited 229 patients, had 28 LTFU and an FI of 5.5 [54].
To put the figures we have reported into context, we offer a number of comparisons. A 2018 fragility analysis of RCTs with "strong evidence," which contributed to the American Academy of Orthopaedic Surgeons (AAOS) clinical practice guidelines, found the median FI to be 2 and a median FQ of 0.022, with 53% of RCTs statistically underpowered [83]. In a series of twelve other recently published surgical fragility analyses, we calculated the median FI to be 3 and an FQ of 0.039 [81,82,[84][85][86][87][88][89][90][91][92][93]. Three of these fragility analyses reported the mean impact factor of the journals their constitute RCTs were published in [3.2, 2.4 and 5.4] [82,84,86], which were all less than the mean of 6.467 in this review. The FI and FQ of this review of AATR management may be described as slightly more robust than the existing literature. These findings, however, suggest that although more robust than other areas of orthopaedic literature, studies reporting outcomes of AATR remain fragile, with few events required to result in a reversal of statistical significance.
In their recent meta-analysis, Seow et al. report the most comprehensive investigation of AATR to date [12]. They found that conservative management of AATR results in 4-times higher re-rupture rates than operative repair but was favourable with respect to other complications, especially infections. They found that MIS was significantly favoured over OR for outcomes other than re-rupture, for which there was no difference. A non-significant finding of early rehab being preferred to later rehab was also reported [12]. The available evidence for the management of AATR suggests a clinical decision based on the individual patient factors is most appropriate. Although the findings of this review show no evidence to dispute their conclusion that there is no preferential Median Fragility Index of Trials Grouped by Power Analysis Fig. 3. Boxplot of statistical power subgroup analysis -blue, power analysis not reported; orange, appropriate statistical power; grey, statistically underpowered.FI, fragility index.  treatment with respect to all possible complications, the RCTs reporting such outcomes remain relatively fragile, and further studies in this area will be required. This review lends further support to calls for so-called "triple reporting" of P value, FI and FQ, along with robust information on those LTFU [84,91]. This will allow readers better determine the statistical legitimacy of a study's findings. The American Statistical Association (ASA) statement in 2016 reaffirmed that P values do not measure the probability of a true result, "the size of an effect or the importance of a result" [94]. The reporting of FI and FQ is not a statistical silver bullet, but it does help interpret a P value in an era in which the ASA say decisions should not be based solely on arbitrary significance thresholds [94].

Limitations
In spite of this, our study is not without potential limitations. One potential limitation of the analysis is the exclusive review of RCT, which excluded many informative comparative studies. However, the authors believe this to be appropriate as the use of FI should be confined to RCTs alone to avoid the potential risk of selection bias and confounding factors, which may be present in non-randomised studies [85]. It is also   justifiable as the studies selected were at the top of the hierarchy of evidence [21]. The exclusion of texts in other languages was unlikely to have changed our findings [95][96][97]. Another potential complication is that a majority of the dichotomous events analysed were secondary findings. Studies are powered for the detection of the primary outcome and so might be underpowered for secondary outcomes. Many of the included secondary outcomes were important clinical outcomes, which are used to inform practice, and so warranted review. Furthermore, many of the studies report scales such as Achilles' tendon Total Rupture Score, which are continuous variables, as there is no cut-off score indicating a positive or negative result; this variable cannot be dichotomised or included in a FI analysis [98]. Another obstacle we encountered was the reporting of scales, such as the Leppilahti Score, which has continuous data in the form of scores from 0 to 100, and categorical data in the form of four outcome categories [99]. This scale can easily be dichotomised into excellent and good vs fair and poor. This is impossible when only the mean score is reported, as was the case for the majority of included studies. We would encourage the publication of the number of patients in each category, in addition to the overall mean score, as it is both more informative and allows the calculation of an FI.

Conclusion
This review indicates the RCT literature for AATR management may be vulnerable to statistical fragility, with a handful of events required to reverse a finding of significance. We recommend that future trials in this area report the FI and FQ in addition to the P value, to aid readers in assessing the evidence, therefore impacting clinical decision making.

Funding
No funding was sought or received.

Ethical approval
No ethical approval was sought for this study as it's a systematic review of existing evidence.

Author contribution
All of the aforementioned authors have contributed to the manuscript under all four of the following: 1. Substantial contributions to the conception or design of the work; or the acquisition, analysis, or interpretation of data for the work 2. Drafting the work or revising it critically for important intellectual content 3. Final approval of the version to be published 4. Agreement to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.