Large Sample Sizes Alone May Not Save The Day

Our comment on reliability issue just came out in Nature Human Behaviour recently. It gives a warning to the science of individual differences: big data not only matter sample sizes, but also measurement reliability (if not the top importance).

Harnessing Reliability for Neuroscience Research

Xi-Nian Zuo*, Ting Xu, Michael Peter Milham

Neuroscientists are amassing the big data needed to study individual differences and identify biomarkers. However, measurement reliability within individual samples is often suboptimal, thereby requiring unnecessarily large samples. We comment on reliability in neuroimaging and provide examples of how the reliability can be increased.

The neuroimaging community has made significant strides towards collecting large-scale neuroimaging datasets, which – until the past decade – had seemed out of reach. Between initiatives focused on the aggregation and open sharing of previously collected datasets, and de novo data generation initiatives tasked with the creation of community resources, tens of thousands of datasets are now available online. These span a range of developmental statuses and disorders and many more will soon be available. Such open data are allowing researchers to increase the scale of their studies, apply various learning strategies (e.g., artificial intelligence) with ambitions of brain-based biomarker discovery, and address questions regarding the reproducibility of findings – all at a pace that is unprecedented in imaging. However, based on the findings of recent works (1-3), few of the datasets generated to date contain enough data per subject to achieve maximally reliable measures of brain connectivity. Although our examination of this critical deficiency focuses on the field of neuroimaging, the implications of our argument and the statistical principles discussed are broadly applicable.

Scoping the problem

Our concern is simple: researchers are working hard to amass large-scale datasets, whether through data sharing or coordinated data generation initiatives, but failing to optimize their data collections for relevant reliabilities (e.g., test-retest, inter-rater, etc) (4). They may be collecting larger amounts of suboptimal data, rather than smaller amounts of higher-quality data – a tradeoff that does not bode well for the field, particularly when it comes to making inferences and predictions at the individual level. We believe that this misstep can be avoided by critical assessments of reliability upfront.

The tradeoff we observe occurring in neuroimaging reflects a general tendency in neuroscience. Statistical power is fundamental to studies of individual differences, as it determines our ability to detect effects of interest. While sample size is readily recognized as a key determinant of statistical power, measurement reliabilities are less commonly considered, and at best only indirectly considered when estimating required sample sizes. This is unfortunate, as statistical theory dictates that reliability places an upper limit on the maximum detectible effect size.

The interplay between reliability, sample size, and effect size in determinations of statistical power is commonly underappreciated in the field. To facilitate a more direct discussion of these factors, Figure 1 depicts the impact of measurement reliability and effect size on the sample sizes required to achieve desirable levels of statistical power (e.g., 80%); these relations are not heavily dependent on the specific form of statistical inference employed (e.g., two-sample t-test, paired t-tests, three-level ANOVA). Estimates were generated using the pwr package in R and are highly congruent with results from Monte Carlo simulations (5). Relevant to neuroscience, where the bulk of findings report effect sizes ranging between modest to moderate (6), the figure makes obvious our point that increasing reliability can dramatically reduce the sample size requirements (and therefore cost) for achieving statistically appropriate designs.

In neuroimaging, the reliability of the measures employed in experiments can vary substantially (3,4). Focusing on MRI, morphological measures are known to have the highest reliability, with most voxels in the brain exhibiting reliabilities measured as intraclass correlation (ICC) > 0.8 for core measures (e.g., volume, cortical thickness and surface area). For functional MRI (fMRI) approaches, reliability tends to be lower and more variable, heavily dependent on the scan state, the nature of the measure employed and most importantly – the amount of data obtained (e.g., for basic resting-state fMRI measures, the mean ICC obtained across voxels may increase by 2-4 times as one increases from 5 minutes to 30 minutes of data) (2,3). Limited inter-individual variability can be a significant contributor to findings of low reliability for fMRI, as its magnitude relative to within subject variation is a primary determinant of reliability. Such a concern has been raised for task fMRI (7), which directly borrows behavioral task designs from the psychological literature (8).

Potential implications

From a statistical perspective, the risks of underpowered samples yielding increased false negatives and artificially inflated effect sizes (i.e., Winner’s Curse Bias) are well known. More recently, the potential for insufficiently powered samples to generate false positives has been established as well (9). All these phenomena reduce the reproducibility of findings across studies, a challenge that other fields (e.g., genetics) have long worked to overcome. In the context of neuroimaging or human brain mapping, an additional concern is that we may be biased to overvalue those brain areas for which measurement reliability is greater. For example, the default and fronto-parietal networks receive more attention in clinical and cognitive neuroscience studies of individual and group differences. This could be appropriate, but it could also reflect the higher reliabilities of these networks (3,4).


Our goal here is to draw greater attention to the need for assessment and optimization of reliability in neuroscience research, which is typically underappreciated. Whether one is focusing on imaging, electrophysiology, neuroinflammatory markers, microbiomics, cognitive neuroscience paradigms, or on-person devices, it is essential that we consider measurement reliability and its determinants.

For MRI-based neuroimaging, a repeated theme across the various modalities (e.g., diffusion, functional, morphometry), is that higher quality data require more time to collect, whether due to increased resolution or repetitions. As such, investigators would benefit from assessing the minimum data requirements to achieve adequately reliable measurements before moving forward. An increasing number of resources are available for such assessments of reliability (e.g., Consortium for Reliability and Reproducibility, MyConnectome Project, Healthy Brain Network Serial Scanning Initiative, Midnight Scan Club, Yale Test-Retest Dataset, PRIMatE Data Exchange). It is important to note that these resources are primarily focused on test-retest reliability (4), leaving other forms of reliability less explored (e.g., inter-state reliability, inter-scanner reliability; see recent efforts from a Research Topic on reliability and reproducibility in functional connectomics:

Importantly, reliability will differ depending on how a given imaging dataset is processed, and which brain features are selected. A myriad of different processing strategies and brain features have emerged, but they are rarely compared with one another to identify those most suitable for studying individual differences. In this regard, efforts to optimize analytic strategies for reliability are essential, as they make it possible to decrease the minimum data required per individual to achieve a target level of reliability (1-4,10). This is critically important for applications in developing, aging and clinical populations, where scanner environment tolerability limits our ability to collect time-intensive datasets. An excellent example of quantifying and optimizing for reliability comes from functional connectomics. Following convergent reports that at least 20-30 minutes of data are needed to obtain test-retest reliability for traditional pair-wise measures of connectivity (2), recent works have suggested the feasibility of combining different fMRI scans in a session (e.g., rest, movie, task) to make up the differential in calculating reliable measures of functional connectivity (2,11).

Cognitive and clinical neuroscientists should be aware that many cognitive paradigms used inside and outside of the scanner have never been subject to proper assessments of reliability, and the quality of reliability assessments for questionnaires (even proprietary) can vary substantially. As such, the reliability of the data being used on the phenotyping side is often an unknown in the equation and can limit the utility of even the most optimal imaging measures – a reality that also affects other fields (e.g., genetics) and inherently compromises such efforts. Although not always appealing, an increased focus on the quantification and publication of minimum data requirements and their reliabilities for phenotypic assessments is a necessity, as is the exploration of novel approaches to data capture that may increase reliability (e.g., sensor-based acquisition via wearables and longitudinal sampling via smartphone apps).

Finally, and perhaps most critically, there is marked diversity in how the word reliability is used and a growing number of separate reliability metrics are appearing. This phenomenon is acknowledged in a recent publication (12) by an Organization for Human Brain Mapping workgroup tasked with generating standards for improving reproducibility. We suggest it would be best to build directly on the terminology and measures well established in other literatures (e.g., statistics, medicine) rather than start anew (13). To avoid confusions in terminology, particularly those between reliability and validity – two related, though distinct concepts that are commonly used interchangeably in the literature. To facilitate an understanding of this latter point, we include a statistical note on the topic below.

A confusion to avoid

It is crucial that researchers acknowledge the gap between reliability and validity, as a highly reliable measure can be driven by artifact rather than meaningful (i.e., valid) signal. As illustrated in Figure 2, this point becomes obvious when one considers the differing sources of variance associated with the measurement of individual differences (14). First, we have the portion of the variance measured across individuals that is the trait of interest (Vt) (e.g., between-subject differences in gray matter volume within left inferior frontal gyrus). Second is the variance related to unwanted contaminants in our measurement that can systematically vary across individuals (Vc) (e.g., between-subject differences in head motion). Finally, is random noise (Vr), which is the commonly treated as the within-subject variation. Reliability is the proportion of the total variance that can be attributed to systematic variance across individuals (including both Vt and Vc) (see Eq 1); in contrast, validity is the proportion of the total variance that can be attributed specifically to the trait of interest alone (Vt) (see Eq 2).

Reliability = (Vt + Vc) / (Vt + Vc + Vr)        (1)

Validity = Vt / (Vt + Vc + Vr)         (2)

As discussed in prior works (14), this framework indicates that a measure cannot be more valid than reliable (i.e., reliability provides an upper bound for validity). So, while it is possible to have a measurement that is sufficiently reliable and completely invalid (e.g., a reliable artifact), it is impossible to have a measurement with low reliability that has high validity.

A specific challenge for neuroscientists is that while reliability can be readily quantified, validity cannot, as Vt cannot be directly measured. As such, various indirect forms of validity are used, which differ in the strength of the evidence required. At one end, is criterion validity, which compares the measure of interest to an independent measure designated as the criterion or “gold standard” measurement (e.g., comparison of individual differences in tracts identified by diffusion imaging to postmortem histological findings, or in fMRI-based connectivity patterns to intracranial measures of neural coupling or magnetoencephalography). At the other extreme is face validity, in which findings are simply consistent with “common sense” expectations (e.g., does my functional connectivity pattern look like the motor system?). Intermediate to these are concepts such as construct validity, which test if a measure varies as would be expected if it is indexing the desired construct (i.e., convergent validity) and not others (i.e., divergent validity) (e.g., do differences in connectivity among individuals vary with developmental status and not head motion or other systematic artifacts?). An increasingly common tool in the imaging community is predictive validity, where researchers test the ability to make predictions regarding a construct of interest (e.g., do differences in the network postulated to support intelligence predict differences in IQ?). As can be seen from the examples provided, different experimental paradigms offer differing levels of validity, with the more complex and challenging offering the highest forms. From a practical perspective, what researchers can do is make best efforts to measure and remove artifact signals such as head motion (4,15), and to work to establish at highest form validity possible using the methods available.

Closing remarks

As neuroscience make strides in efforts to deliver clinically useful tools, it is essential that assessments and optimizations for reliability become common practice. This will require improved research practices among investigators, as well as support from funding agencies in the generation of open community resources upon which these essential properties can be quantified.

Code availability

All code employed in this effort can be found on GitHub.


  1. Laumann, T.O. et al. Neuron 87, 657-670 (2015).
  2. O’Connor, D. et al. Gigascience 6, 1-14 (2017).
  3. Xu, T., Opitz, A., Craddock, C., Zuo, X.N. & Milham, M.P. Cereb. Cortex 26, 4192-4211 (2016).
  4. Zuo, X.N. & Xing, X.X. Neurosci. Biobehav. Rev. 45, 110-118 (2014).
  5. Kanyongo, G.Y., Brook, G.P., Kyei-Blankson, L. & Gocmen, G. J. Mod. Appl. Stat. Methods 6, 81-90 (2007).
  6. Poldrack, R.A. et al. Nat. Rev. Neurosci. 18, 115-126 (2017).
  7. Bennett, C.M. & Miller, M.B. Ann. N.Y. Acad. Sci.1191, 133-155 (2010).
  8. Hedge, C., Powell, G., & Sumner, P. Behav. Res. Methods50, 1166-1186 (2018).
  9. Button, K.S. et al. Nat. Rev. Neurosci. 14, 365-376 (2013).
  10. Tomasi, D.G., Shokri-Kojori, E. & Volkow, N.D. Cereb. Cortex 27, 4153-4165 (2017).
  11. Elliott, M.L. et al. Neuroimage 189, 516-532 (2019).
  12. Nichols, T.E. et al. Nat. Neurosci. 20, 299-303 (2017).
  13. Koo, T.K. & Li, M.Y. J. Chiropr. Med. 15, 155-163 (2016).
  14. Kraemer, H.C. Annu. Rev. Clin. Psychol. 10, 111-130 (2014).
  15. 15. Yan, C.G. et al. Neuroimage 76, 183-201 (2013).

Leave a Reply

%d bloggers like this: