Challenging the dogma of statistical significance in the interest of better decision-making

By April 1, 2019Academia

Statistical inference is the perilous but necessary exercise of drawing conclusions from (sometimes very) limited datasets, usually with a view on facilitating some sort of decision – whether to go forward with a particular line of R&D, or whether to approve a drug for a given indication.

In a fantasy world of unlimited resources and omniscient decision-makers, statistical inference would be a futile exercise. Seeing how this is not the case, many bright minds have dedicated significant efforts to developing methodologies that allow us to draw reasonable conclusions from limited datasets. Much of the scientific enterprise rests on our ability to test hypotheses on the basis of datasets via generally accepted statistical methods. The emergence of such methods via the interplay between some of classical statistics’ major contributors is summarized in this book review. Among other things, it is shown that Fisher’s own thinking around ‘statistical significance’ evolved over time, with the late Fisher emphasizing that “no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case, and his ideas”.

Ronald Fisher would probably be dismayed to find out that the notion of ‘statistical significance’ he introduced has turned into precisely that which it was not meant to be, namely a means of over-simplification of statistical findings which keeps leading laymen and even scientists astray. In a March 20 Nature piece, Valentin Amrhein, Sander Greenland and Blake McShane produce a cogent indictment of the uncritical use of ‘statistical significance’ as the be all, end all of hypothesis testing in academia and beyond. Their call for an end to the concept of ‘statistical significance’ is at once a call for the nuanced interpretation which statistical findings require by their very nature. For instance, sample size needs to be adequately large in relation to the overall population that is being studied in order to minimize the impact of underlying heterogeneity. This is often reflected in confidence intervals narrowing in response to increased sample size.

A major point of contention for Amrhein et al. is that researchers often interpret similar datasets (based on point estimates of effect size) as either reflecting a genuine effect or no effect at all solely on the basis of the p-value obtained for a given sample. Source: V. Amrhein et al.
The authors back up their critique of such reductionist logic by an analysis of a meaningful (n=791) sample of academic publications showing that around half such publications erroneously equate ‘non significance’ with ‘no effect’. Source: V. Amrhein et al.

In the life sciences, go / no go decisions are crucial to the efficient allocation of scarce resources, and at the end of the road, the regulator also needs to be presented with ‘substantial evidence’ of a drug’s safety and efficacy for the target population. However, we would be kidding ourselves if we postulated that a single statistical parameter would suffice to inform such go / no go decisions, or to qualify evidence as ‘substantial’ in the eyes of the regulator. Traditionally, the Food and Drug Administration (FDA) required at least 2 adequately powered, randomized studies for drug approval, but in many cases today, the agency is willing to grant approval on the basis of a single such study, provided it yields statistically significant outcomes. While there are certainly instances in which it is appropriate to require only a single randomized study, or even approval on the basis of ‘surrogate endpoints’; as far as non-orphan indications are concerned, it would seem more clinically meaningful if a sponsor were to produce, for instance, 3 large studies showing meaningful effect size and reasonably narrow confidence intervals even if such studies were to produce p-values of variable significance, rather than ‘lucking it out’ on a single study which happens to yield p<0.05.

However, given that ‘statistical significance’ has reigned supreme for quite some time both in the literature and with the regulator, should anyone be surprised that a superficial reading of clinical data still dominates the news cycle or that investors tend to over-rely on the magical statistical threshold known as p<0.05?

In his famous tryptich, Hieronymus Bosch illustrated, sequentially, the Garden Eden or Paradise, followed by ‘the garden of earthly delights’ and finally a hellish landscape. Each panel foreshadows the next by seeding elements of mortality, evil, or sin, leading to deterioation of the human condition. Awareness of ‘original sin’ should guard us against indulging in behavior which potentiates our own failings, or in the case of statistics, the inadequacy of ‘p<0.05’ as an overarching measure. Source: Museo del Prado, Madrid, Spain

One key section of Amrhein et al.’s paper is titled Quit categorizing, which in my estimation really summarizes the whole conundrum. The trouble started with Fisher proposing, and his contemporaries adopting, an arbitrary threshold (0.05) below which results should be dubbed ‘statistically significant’. As we highlighted earlier, the ‘late’ Fisher himself emphasized the nature of the datasets under consideration and the goals of a particular line of scientific inquiry over a static ‘stat sig’ threshold, perhaps in recognition of the dangers inherent in ‘<0.05’ becoming dogma.

In my occupation as an investor in the life sciences, I am frequently confronted with what might be appropriately labeled ‘stat sig fundamentalism’, which not unlike religious fundamentalism, continues to draw an intellectually lazy & largely ignorant crowd with unrealistic promises of easy solutions to complex problems. It is also apparent that a number of drug development outlets somewhat maliciously focus their efforts around producing ‘stat sig’ results at all costs, knowing of course that such results are highly PR-worthy and will be propagated & soaked up by the trading masses.

In a December 2018 report, I indicted the misleading use of ‘stat sig’ against small statistical samples, in particular in the context of ‘PhII’ trials which cannot support regulatory approval. To help wean investors off their reliance on p-values as a one-size-fits-all measure of relevance, I suggested they ask themselves the following:

  • How representative is a given clinical trial cohort of the overall patient population?
  • What is the natural history of the condition and what is the efficacy of existing therapies?
  • How informative are measures of statistical significance against a given dataset?

I then went on to explain that a representative sample size may vary materially depending on the prevalence & etiology of a given condition, citing two contrasting examples: 1) a rare condition with invariably poor outcomes and low susceptibility to existing treatments 2) a widespread condition with variable outcomes and some susceptibility to existing treatments.

A small sample showing therapeutic benefit from an investigational drug, yet lacking placebo control and formal hypothesis testing, may be adequate to reasonably suggest significant therapeutic benefit in the context of rare, practically intractable diseases. In such instances, given the difficulties – both practical and ethical – involved with enrolling large, placebo-controlled trials with a view on generating ‘stat sig’ outcomes, it is appropriate to consider the typical clinical trajectory of patients with this rare condition and to subsequently evaluate ‘numerical’ outcomes on study drug against this expected trajectory in the absence of a new therapeutic intervention. This in turn necessitates a deep understanding of ‘natural history’ for the condition, which must almost invariably draw on feedback from the trifecta of healthcare stakeholders – patients, caregivers (typically family members) and physicians. In recognition of this important consideration for rare disease drug development, the FDA recently issued draft guidance on natural history studies. While emphasizing proper planning & methodology, this draft guidance recognizes the ‘sociological’ aspect of developing therapies for rare diseases by emphasizing that “consideration should be given to enlisting the help of disease-specific support groups or patient advocacy groups because they are invaluable resources for identifying and helping to recruit patients. They also can contribute to study design and execution because of their unique perspectives“. Furthermore, FDA encourages the conduct of natural history studies with a view on improving outcomes for patients independently of a given investigational therapy’s odds of approval: “The benefits of planning, organizing, and implementing a natural history study may go beyond drug development. A natural history study may benefit patients with rare diseases by establishing communication pathways, identifying disease-specific centers of excellence, facilitating the understanding and evaluation of the current standard of care practices, and identifying ways to improve patient care. A natural history study may provide demographic data and epidemiologic estimates of the prevalence of the disease and disease characteristics and aid disease tracking.”

It should become immediately apparent that a collaborative approach both between industry and the public sector & civil society, but also between actors within the private sector, is in the best interest of patients and industry by facilitating the approval of innovative therapies for the ‘right’ patients on the basis of a comprehensive understanding of natural history which enables appropriate evaluation criteria which go beyond ‘p<0.05’. A deeper exploration of the rational and promise of public-private and private-private partnerships in the pharmaceutical space merits a standalone examination, which could be the subject of future blog posts. For now, let us revert to a more general consideration of possible ways forward beyond the dogma of ‘stat sig’.

The primary contentions advanced by Amrhein et al. are neither new, nor are the solutions they propose, which is at once indicative of the persistence of the problem and of a certain consistency in the thinking of those seeking to supersede ‘stat sig’ with more comprehensive evaluation. A much-cited 2014 piece by Regina Nuzzo titled Scientific method: Statistical errors, also published in Nature, already laid out much that is wrong with our over-reliance on ‘stat sig’: stat sig in a single, small sample does not necessarily translate into reproducibility, does not indicate meaningful effect size and incentivizes unscrupulous actors to engage in ‘p-hacking’. The author recognized that “any reform would need to sweep through an entrenched culture. It would have to change how statistics is taught, how data analysis is done and how results are reported and interpreted”, which is of course what Amrhein et al. are attempting to induce by collecting a significant number of endorsements for their proposals.

By accepting that the ‘solution’ to a deeply rooted problem caused by worship of a single statistical measure cannot be an equally unidimensional alternative, instead requiring complementary statistical methods, confirmation via reproduction, and due implementation of background considerations; we will inevitably re-discover a sense of agency and the real-world consequences of human judgment. In the context of healthcare R&D and investment, a coming-to-terms with the fact that statistical analysis is but a tool, and not a substitute for human judgment, should strengthen our resolve to do what is right and necessary to drive the only meaningful outcome we can hope to produce – namely, a meaningful, positive impact on the lives of patients.