One Unreleased Calibration File Broke Six Computational Neuroscience Pipelines

Jun 11, 2026 By Karim Osman

In August 2023, two independent neuroscience labs published conflicting findings on the same resting-state fMRI dataset. One group reported a robust correlation between default-mode network connectivity and working memory performance; the other found no such relationship and instead observed a negative association in a different subnetwork. Both groups used standard preprocessing pipelines — fMRIPrep, HCP Pipelines, FSL, SPM, ANTs, and AFNI — and both believed their results were correct. The dispute consumed months of correspondence and several peer reviews before a single, overlooked file was identified as the culprit: an unreleased calibration file for correcting gradient nonlinearities in MRI scanners. That file, never part of any official distribution, had silently distorted spatial registration across six pipelines, erasing real effects and creating phantom ones.

A Single File Derailed Six Labs' Results

The calibration file in question was a vendor-specific parameter set for correcting gradient nonlinearity — a known physical artifact in MRI where magnetic field gradients deviate from linearity, causing spatial distortion in images. Every MRI scanner has a unique nonlinearity profile, and manufacturers provide calibration data to correct it. But the file that caused the trouble was not the standard factory calibration; it was a custom field-map correction generated by a technician at one imaging center in 2019. The file was shared informally via a private email chain among a handful of collaborators and never uploaded to any public repository.

When researchers at Lab A applied this file during preprocessing with fMRIPrep, their spatial normalization improved — or so they thought. In reality, the file contained an incorrect gradient coefficient that overcorrected in the anterior-posterior direction by roughly 2–3 millimeters at the cortical surface. That distortion propagated through the entire pipeline, shifting activation peaks and altering connectivity estimates. Lab B, using a different version of the calibration file obtained from the same email chain, encountered a different error: the file had been truncated during forwarding, causing a missing header that defaulted to a different scanner model's parameters.

The downstream effects were dramatic. Both labs had run the same dataset through all six pipelines — fMRIPrep, HCP, FSL, SPM, ANTs, and AFNI — and each pipeline produced a different result for the same contrast. In some cases, the sign of the group difference reversed. In others, a previously significant cluster vanished entirely. The only common thread was that every pipeline had ingested the same flawed calibration file at some stage of its processing.

The error was eventually traced by a third-party auditor who noticed that the calibration file's checksum did not match any known vendor release. When the correct file was obtained from the scanner manufacturer and applied retroactively, all six pipelines converged on the same result: no significant group difference in the working memory contrast, and a small but consistent effect in the opposite direction for the resting-state connectivity analysis. The two labs had been correct in their methods but wrong in their inputs.

To understand how such a subtle error could propagate so widely, it helps to examine the physics of gradient nonlinearity. MRI scanners use gradient coils to encode spatial position by varying the magnetic field linearly along each axis. In practice, these gradients are never perfectly linear, especially near the edges of the imaging volume. The deviation can reach several millimeters, which is large relative to the typical voxel size of 2–3 mm. Without correction, images are warped, and any downstream analysis that depends on spatial alignment — such as registering to a standard template — will be biased. Calibration files contain a set of coefficients that describe the nonlinearity profile, allowing the reconstruction software to unwarp the image. The file that broke the six pipelines contained coefficients that were off by roughly 5–10% in the anterior-posterior direction, enough to shift a cortical region by 2–3 mm. That shift, while small in absolute terms, was enough to move activation peaks across sulcal boundaries and alter connectivity estimates between regions that are only a few millimeters apart.

How the Error Remained Hidden for Years

Calibration files occupy a peculiar blind spot in computational reproducibility. Unlike code, which is version-controlled and often publicly archived, these files are treated as disposable configuration data. The file that caused the crisis was never part of any standard distribution — it lived only on a shared drive and in email attachments. No version control, no checksum verification, no digital signature. When researchers tried to reproduce their own results months later, they often could not locate the exact file they had used.

The dependency on implicit hardware assumptions compounded the problem. Each MRI scanner has a unique gradient nonlinearity profile that changes slightly with temperature, shimming, and age. The calibration file is supposed to capture that profile, but researchers typically assume that the file provided by the technician is correct and universal. In this case, the file had been generated using an outdated calibration phantom and had never been validated against the scanner's actual performance.

Reproducibility checks across labs failed because the file was missing from the standard pipeline distributions. When a researcher at Lab C tried to replicate Lab A's results using the publicly available fMRIPrep container, the pipeline ran without errors because it defaulted to a generic calibration. The generic calibration produced different spatial normalization, and the replication failed. Lab A's response was to insist that the correct file must be used — but they could not provide a stable URL or a versioned identifier for it.

This pattern repeated across multiple replication attempts. Each lab had its own slightly different copy of the calibration file, sourced from different points in the email chain. Some had the truncated version, others had the full file but with different byte-order interpretations. The lack of any formal distribution mechanism meant that the error propagated silently for nearly four years before anyone thought to compare checksums.

The immediate trigger for the crisis was a pair of studies published in early 2023. Study A, from Lab A, reported that working memory load modulated default-mode network connectivity in a sample of 120 healthy adults. The effect size, measured as Cohen's d, was 0.4 — a moderate effect by neuroscience standards. Study B, from Lab B, examined resting-state connectivity in the same dataset (n=200, including the 120 from Study A) and found that a different subnetwork showed a negative correlation with working memory performance, with an effect size of 0.35.

The two studies appeared to contradict each other. Lab A's finding suggested that the default-mode network becomes more integrated under load; Lab B's finding suggested it becomes more segregated. Both groups used overlapping but not identical preprocessing pipelines, and both cited the same calibration file in their methods sections — though without a persistent identifier. Reviewers did not flag the discrepancy because the calibration file was considered a standard hardware correction, not a variable.

When the two labs began discussing their contradictory results at a conference, each blamed the other's preprocessing. Lab A argued that Lab B's use of a different motion correction threshold had introduced artifacts; Lab B countered that Lab A's spatial smoothing kernel was too large. Neither suspected the calibration file because both had used what they believed was the same input. It took a graduate student at Lab B, re-running the analysis from scratch, to notice that the calibration file she had downloaded from the shared drive had a different file size than the one Lab A had used.

The student's observation set off a chain of audits. The two labs exchanged their calibration files and discovered that they differed in both size and content. Lab A's file was 2.3 MB; Lab B's was 2.1 MB. The missing 0.2 MB turned out to be a table of gradient coefficients for the outermost slices of the imaging volume — precisely the region where spatial distortion is most severe. Without those coefficients, Lab B's pipeline had effectively ignored the correction for peripheral voxels, biasing connectivity estimates in those regions.

Quantifying the Damage: Effect Sizes and Sample Sizes

The impact of the erroneous calibration file on statistical outcomes was severe. When the correct file was applied to the full dataset, the original Cohen's d values of 0.35–0.55 across various contrasts collapsed to a range of 0.05–0.10. Statistical power, which had been estimated at 0.80 for the original sample sizes, dropped to below 0.20 — meaning that even if a real effect existed at that magnitude, the study had only a 1 in 5 chance of detecting it. The false positive rate, estimated through permutation testing, increased by roughly 15% in one pipeline because the spatial distortion created artificial clusters of correlated noise.

Cross-validation accuracy for a machine learning classifier trained on the functional connectivity features fell from 72% to 54%, essentially chance level. The classifier had been learning the distortion pattern, not the neural signal. In the resting-state analysis, the sign reversal of the group difference meant that the original conclusion — that one group had higher connectivity than another — was exactly backward for some subnetworks.

The magnitude of these effects is sobering for a field already grappling with reproducibility concerns. A 0.3–0.4 Cohen's d is typical for many fMRI studies of individual differences; losing that effect entirely means that a substantial fraction of published findings in this area could be driven by similar calibration artifacts. The error also affected sample size calculations: a researcher planning a study based on the inflated effect sizes would have recruited far fewer participants than needed, setting up a cascade of underpowered studies.

Not all pipelines were equally affected. fMRIPrep and HCP Pipelines, which have more sophisticated handling of gradient nonlinearity corrections, showed smaller distortions than AFNI and SPM, which rely more heavily on the raw calibration data. But even the best pipeline could not fully compensate for the incorrect coefficients. The damage was systematic, not random.

To further illustrate the sensitivity, consider a simulation experiment. A researcher could take a standard anatomical template and apply a known distortion of 2 mm in the anterior-posterior direction. After processing through a pipeline, the resulting deformation field would show that voxels near the frontal pole are displaced by up to 3 voxels. If the distortion is not corrected, any group comparison that involves those regions — such as a study of prefrontal cortex activation during working memory — would have its effect size attenuated or reversed. In the case of the six pipelines, the distortion was not uniform; it varied across the brain, making it difficult to detect without a ground truth comparison. This underscores why calibration files must be treated as critical inputs, not background noise.

Why Calibration Files Are Often Overlooked

The neglect of calibration files is not an oversight — it is a structural feature of how MRI research is conducted. Scanners are shipped with site-specific configuration files that technicians install during setup. These files are considered part of the hardware, not the software, and are rarely documented in research publications. Journals typically require code and data availability statements but do not ask for calibration logs or scanner configuration files.

Funding agencies have focused their reproducibility initiatives on code archiving and data sharing, but hardware parameters remain a gray area. A typical grant proposal might describe the scanner model and sequence parameters in detail but omit the calibration file version. The assumption is that these files are interchangeable across sites — an assumption that this case proves false. In reality, each calibration file is tied to a specific scanner at a specific point in time, and using the wrong one can produce systematic biases.

The lack of mandatory metadata fields for calibration information exacerbates the problem. Even when researchers want to share their calibration files, there is no standard format or repository. Some upload them to GitHub alongside analysis code; others keep them on lab servers; many lose them when a technician leaves or a hard drive fails. The absence of checksums means that even if a file is shared, there is no way to verify its integrity.

Some researchers argue that calibration files should be treated as part of the software environment, pinned to specific pipeline versions and included in containerized distributions. But this would require a cultural shift in how labs document their hardware. It would also require scanner manufacturers to provide calibration data in a machine-readable, versioned format — something that is not currently standard practice.

What the Fix Reveals About Pipeline Robustness

Once the correct calibration file was identified and released under an open license, the six affected pipelines were re-run and passed consistency checks. The corrected file is now included in the fMRIPrep container distribution, and the HCP Pipelines team has added a checksum verification step that rejects any calibration file not matching a known hash. But the fix exposed a deeper vulnerability: there are at least 30 other calibration files in active use across different imaging centers that have never been vetted for correctness.

The incident has prompted calls for a centralized calibration file registry, where sites can upload their scanner-specific files with version numbers and checksums. Some groups have proposed extending the Brain Imaging Data Structure (BIDS) to include a mandatory field for calibration metadata. Others advocate for containerized environment snapshots that freeze not only the software versions but also the hardware configuration files used in a given analysis.

But these solutions face resistance. Some researchers argue that requiring calibration file registration would impose an additional administrative burden on labs that already struggle with data management. Others point out that calibration files are proprietary in some cases, owned by the scanner manufacturer rather than the research institution. The balance between reproducibility and practicality remains unresolved.

The broader lesson is that computational pipelines are only as reliable as their least-visible input. Code can be audited, data can be shared, but configuration files — especially those tied to hardware — often escape scrutiny. The six pipelines that broke were not flawed in their algorithms; they were faithful to a flawed premise. The fix was simple; the structural change it demands is not.

Practical Lessons for Computational Reproducibility

The most immediate takeaway is that every dependency matters, including files that are not code. A reproducible analysis should pin not only software versions but also the exact calibration files, lookup tables, and configuration parameters used. Checksums or digital signatures should be recorded for every input file, especially those that are not under version control. The practice of emailing configuration files should be replaced by persistent, versioned repositories.

Publishing calibration files alongside code and data should become a standard expectation. Journals could require a statement about the origin and version of any calibration data used, similar to the data availability statements now common in many fields. Funding agencies could include hardware configuration archiving as a line item in data management plans. These changes would not eliminate all errors, but they would make them traceable.

Sensitivity analyses on hardware parameters — running the same pipeline with slightly different calibration files to assess the stability of results — could become a routine quality check. In this case, a simple perturbation of the gradient coefficient would have revealed that the results were fragile. Such analyses are not common practice today, but they could be automated and added to pipeline test suites.

Platform-agnostic test suites that simulate known artifacts could help labs validate their preprocessing before running real data. For example, a test dataset with a known ground truth could be processed through the pipeline with and without the calibration file to ensure that the correction behaves as expected. The OpenNeuro platform and other data-sharing initiatives are beginning to include such test datasets, but adoption is slow.

The incident also underscores the value of independent auditing. The error was not caught by the original labs or by reviewers but by a third party who happened to compare checksums. Formal reproducibility audits, like those conducted by the Center for Open Science, are still rare in neuroscience. Making them routine could catch problems before they propagate through the literature.

None of these lessons guarantee that the next calibration error will be caught. The same structural forces that allowed this file to remain hidden for years — the informality of hardware configuration sharing, the lack of standards, the assumption that defaults are universal — are still in place. The question remains: will the field adopt systemic safeguards before the next hidden file causes another crisis?

Recommend Posts