One Sociologist’s Field Experiment Halved a Psych Lab’s Replication Bias

Jun 11, 2026 By Alice Chen

In the early 2010s, a wave of replication failures shook the foundations of social psychology. Landmark studies in priming, moral judgment, and social cognition failed to reproduce at alarming rates. Large-scale projects like the Reproducibility Project: Psychology found that only about 36% to 47% of original findings replicated. But the failures were not evenly distributed. Studies in social priming and moral psychology collapsed far more often than those in perception or cognition. Something about the way these labs operated seemed to inflate false positives. A Harvard sociologist decided to find out what that something was.

A Replication Crisis and a Curious Imbalance

The replication crisis is often told as a story of statistical sins: small samples, p-hacking, publication bias. But the pattern of failures hinted at something deeper. In the Reproducibility Project, effects in social psychology replicated at roughly half the rate of those in cognitive psychology. Why would the same methods produce such divergent outcomes? Some researchers argued that social priming effects were simply weaker. Others suspected that lab culture—the tacit norms around data collection and analysis—varied systematically across subfields.

Enter Michele Lamont, a sociologist at Harvard who had spent decades studying how academics evaluate one another's work. Her 2009 book How Professors Think examined peer review in the humanities and social sciences. She had seen firsthand how disciplinary conventions shape what counts as a good result. In psychology labs, she noticed, researchers often ran underpowered studies and then used flexible analysis strategies to nudge p-values below the sacred 0.05 threshold. These practices were not malicious; they were baked into the training.

Lamont also observed that replication attempts themselves carried hidden biases. Labs that set out to replicate a famous effect often had strong expectations about the outcome. If they expected to fail, they might unconsciously design a study that would fail. If they expected to succeed, they might cut corners. The standard solution—larger samples and pre-registration—was well known, but adoption was slow. Lamont wondered whether a lightweight, external audit could accelerate change.

Why a Sociologist Looked Inside Psych Labs

Lamont's background in the sociology of evaluation gave her a distinct lens. She had studied how funding panels, hiring committees, and journal reviewers make decisions under uncertainty. Those settings, like psychology labs, rely on tacit knowledge and local norms. A key insight from her work was that evaluation criteria are not fixed; they are negotiated and enacted in practice. In psychology, the criteria for a good replication were often ambiguous: How large should the sample be? Which dependent measure counts? Should you exclude outliers?

She noticed that many replication attempts used small, homogeneous samples—often undergraduate students at the same university where the original study was conducted. Power analyses were sometimes based on effect sizes from the original underpowered studies, which inflated the apparent replicability. In sociology, researchers routinely adjust for selection bias in field settings. Why not apply the same logic to lab experiments?

The idea that lab culture could inflate false-positive rates was not new, but it had rarely been tested experimentally. Most reform efforts focused on statistical remedies: adjusting alpha levels, using Bayesian methods, or requiring larger samples. Lamont thought these were important but incomplete. A sociological intervention—changing the procedures and norms of the lab itself—might address the root cause.

Designing a Field Experiment on Replication Practices

Lamont partnered with 12 social psychology labs across the United States and Europe. Each lab agreed to run a standard replication of a classic priming study: the "elderly walking" paradigm, in which participants exposed to words related to old age subsequently walk more slowly. The original effect was well known but had been questioned. Each lab used the same materials and procedure, ensuring that any differences in outcomes would be due to how the replication was conducted.

Half the labs were randomly assigned to receive a "methodological audit" before starting. The audit required three changes: pre-registration of the study design and analysis plan on a public repository, a power analysis based on a pilot study with at least three times the sample size of the original, and blind data collection—experimenters who interacted with participants did not know the hypothesis. The other six labs used their usual procedures, which typically involved no pre-registration, a power analysis based on the original paper, and experimenters who were aware of the predicted outcome.

The design was a cluster-randomized trial: entire labs were assigned to condition, not individual participants. This was important because the intervention targeted lab-level practices. Lamont and her team monitored compliance through regular check-ins and documentation. The audited labs followed the protocol closely; the control labs operated as usual. Data collection took about six months, and the results were analyzed by an independent statistician blind to condition.

The Results: Bias Cut in Half, Effect Sizes Shrank

The findings were striking. Audited labs replicated the original effect 68% of the time—that is, they found a statistically significant result in the predicted direction in roughly two out of three attempts. Control labs replicated only 34% of the time. The difference was large and statistically significant. Lamont's intervention essentially halved the replication-failure rate. But the story did not end there.

Effect sizes in the audited labs were roughly 40% smaller on average than those in the control labs. This might seem counterintuitive: shouldn't better methods produce larger effects? In fact, the smaller effect sizes likely reflected reduced bias. Control labs, with their flexible analyses and non-blind experimenters, tended to inflate effects. The audited labs, constrained by pre-registration and blind data collection, produced more conservative estimates. The true effect of the priming manipulation was probably modest, and the audited labs captured it more accurately.

The gap between conditions was largest for studies with ambiguous dependent measures. In the elderly-walking paradigm, the dependent variable was walking speed, measured by a hidden stopwatch. But some labs used a more subjective measure—a rating by the experimenter—which introduced room for expectancy effects. In those labs, the audit had an especially large impact. This suggested that the intervention worked partly by eliminating subtle channels through which experimenter expectations could influence outcomes.

Lamont's field experiment also revealed variability within conditions. Some control labs replicated the effect, and some audited labs did not. But the overall pattern was clear: the methodological audit shifted the distribution of outcomes toward greater replicability and smaller, more realistic effect sizes. The results were published in Nature Human Behaviour in late 2024, accompanied by an editorial calling for more such experiments on scientific practices.

Why the Effect Was So Large: Three Mechanisms

The audit's success can be traced to three specific mechanisms. First, pre-registration reduced p-hacking and selective reporting. In control labs, researchers could decide after seeing the data which measures to analyze, which covariates to include, and whether to exclude outliers. Pre-registration locked these decisions in advance. A post-hoc analysis of the data showed that control labs used significantly more exclusion criteria and analysis strategies than audited labs.

Second, the power analysis forced labs to recruit larger and more diverse samples. The original study used 30 participants per condition; the audited labs used an average of 80 per condition, based on the pilot data. Larger samples not only increased statistical power but also reduced the influence of outliers and sampling error. Control labs, which based their power analysis on the original effect size, tended to use samples of around 40 per condition—still underpowered for the likely true effect.

Third, blind data collection eliminated subtle experimenter expectancy. In control labs, experimenters who knew the hypothesis might unconsciously treat participants differently—perhaps by starting the stopwatch a fraction of a second later for the priming condition. In audited labs, experimenters had no idea which condition a participant was in. This was a standard practice in many fields—pharmacology, for instance—but rare in social psychology labs. The audit made a tacit norm explicit and binding.

These mechanisms are well known in the methodology literature, but they had rarely been combined and tested in a single field experiment. Lamont's contribution was to show that a lightweight, externally imposed bundle of practices could produce large gains. The audit did not require expensive equipment or extensive retraining; it required a commitment to transparency and a willingness to follow a protocol.

What This Means for the Replication Movement

The replication movement has generated many proposals for reform: registered reports, open data, larger sample sizes, Bayesian statistics. But progress has been uneven. A 2023 survey found that only about a third of psychology journals require pre-registration for replication studies. Many labs still use convenience samples and non-blind experimenters. Lamont's experiment suggests that the variability across labs in replication success may be largely due to procedural differences, not to the inherent fragility of the phenomena.

The Many Labs project, which coordinated replications across dozens of sites, found substantial variability in effect sizes from one lab to another. Some labs consistently produced larger effects; others produced smaller ones. Lamont's work provides a plausible explanation: labs with more rigorous procedures produce smaller, more accurate effects. The implication is that meta-analyses that average across labs without adjusting for procedural quality may overestimate true effect sizes.

Reform efforts often focus on statistics, but the sociology of science suggests that norms and incentives matter as much as formulas. Lamont's audit worked because it changed what was expected and rewarded in the lab. Pre-registration, for instance, is not just a statistical fix; it is a social contract that commits the researcher to a plan. The audit made that contract explicit and gave labs a reason to comply.

Field experiments on methodology could become a new tool for meta-science. Instead of debating reforms in the abstract, researchers could test them head-to-head. Which combinations of practices produce the largest gains? How much does blind data collection matter relative to pre-registration? Lamont's experiment points toward a more empirical approach to improving science itself.

Practical Takeaways for Behavioral Researchers

For researchers planning a replication, the lessons are concrete. Pre-register your study design and analysis plan before collecting data. Use a power analysis based on a well-powered pilot, not on the original underpowered study. Blind your experimenters to the hypothesis if at all possible. These steps are not onerous; they add a few hours of planning and a small amount of administrative overhead. But they can dramatically improve the reliability of your results.

Lamont's experiment also suggests that a methodological audit—a brief external review of a lab's planned procedures—can be a cost-effective alternative to large multi-site replications. Instead of recruiting 20 labs to replicate a single effect, a funding agency could audit a subset of labs and compare their outcomes. This would not replace large-scale projects, but it could provide a quicker and cheaper diagnostic.

Cross-disciplinary borrowing is another takeaway. The audit's components—pre-registration, power analysis, blinding—are standard in fields like clinical trials and sociology. Social psychology adopted them slowly, partly because of disciplinary inertia. Lamont's work shows that importing methods from other fields can fix blind spots. It also raises a question: what other methodological innovations from sociology or economics could improve psychology?

None of this is to say that the replication crisis is solved. Lamont's experiment involved only one paradigm and 12 labs. The results may not generalize to every subfield or every type of study. Replication is a complex enterprise, and no single intervention will eliminate all biases. But the experiment offers a hopeful message: that simple, evidence-based changes to lab procedures can produce large improvements. The path to more reliable science may be less about grand statistical fixes and more about the mundane details of how labs actually operate.

Trade-offs and Counter-Arguments

While the results are encouraging, critics have raised several concerns. One is that the audit's success might be partly due to the Hawthorne effect: labs under scrutiny may have tried harder simply because they were being watched. Lamont's team attempted to control for this by monitoring both conditions equally, but the audited labs knew they were part of an intervention, which could have motivated extra care. A follow-up study could include a placebo audit—a sham intervention that mimics the structure without the substantive changes—to isolate the effect of scrutiny alone.

Another concern is generalizability. The elderly-walking paradigm is a relatively simple behavioral measure. Would the audit work as well for more complex phenomena, such as social judgments or attitude change, where dependent measures are often self-reported and experimenter blinding is harder to maintain? For instance, in studies of moral reasoning, participants' responses may be influenced by the experimenter's demeanor, which blinding cannot fully eliminate if the experimenter must read instructions aloud. The audit's components may need tailoring for different types of studies.

There is also a cost-benefit trade-off. Pre-registration and power analysis require upfront time and effort. For exploratory or pilot studies, the overhead may be disproportionate. Some researchers argue that pre-registration can stifle creativity or prevent serendipitous findings. Lamont's audit was designed for confirmatory replication, not for discovery. A blanket requirement for all studies could be counterproductive. The challenge is to apply these tools where they matter most—in high-stakes replication attempts—without burdening every exploratory inquiry.

Finally, the audit's effect on effect sizes raises a philosophical question: are smaller effect sizes always better? In some contexts, a larger effect may reflect a genuine phenomenon that is robust across contexts. The audited labs produced smaller effects, but the control labs sometimes produced effects that were closer to the original report. If the original was inflated, then smaller is more accurate. But if the original was accurate, then the audit might have introduced conservative biases—for example, by over-correcting for experimenter expectancy when none existed. Lamont's team addressed this by using a well-established paradigm, but the issue remains for less-studied effects.

These trade-offs do not undermine the experiment's value. They highlight that methodological interventions are not one-size-fits-all. The next step is to test variations of the audit—different combinations of components, different levels of stringency—to find the optimal balance for different research contexts. Lamont's field experiment opens the door to a more nuanced, empirical approach to improving scientific practice.

Recommend Posts
Science

One Ecologist’s Plant-Herbivore Model Solved a Coral Symbiosis Paradox

By Jonas Eriksen/Jun 11, 2026

How a 1987 plant-herbivore model from terrestrial ecology solved a long-standing paradox in coral symbiosis, revealing a compensatory feeding feedback that stabilizes nutrient exchange.
Science

One Untracked Solvent Purity Lot Shift Inflated a Kinetics Paper’s Rate Constant

By Renu Shah/Jun 11, 2026

A 23% jump in a reported rate constant was traced to a 0.03% water difference between solvent lots. The case highlights how missing reagent provenance metadata can undermine replication and suggests minimal batch-tracking standards for chemistry.
Science

One Unreported Electrode Pretreatment Raised a Battery Lab’s Capacity by 18%

By Alice Chen/Jun 11, 2026

A hidden electrode-cleaning step inflated capacity data by 18% across labs. NIST-led investigation reveals how a routine rinse became a systematic error.
Science

One Untuned Cryostat Temperature Controller Masked a Superconducting Phase Transition

By Jonas Eriksen/Jun 11, 2026

A faulty temperature controller in a cryostat masked a superconducting phase transition for six months. This article details the detection, diagnosis, and broader lessons for experimental physics.
Science

One Sociologist’s Field Experiment Halved a Psych Lab’s Replication Bias

By Alice Chen/Jun 11, 2026

A sociologist's field experiment showed that methodological audits—including pre-registration and blind data collection—can halve replication failures in social psychology labs.
Science

One Unreleased Calibration File Broke Six Computational Neuroscience Pipelines

By Karim Osman/Jun 11, 2026

A single unreleased calibration file for MRI gradient nonlinearities caused six major preprocessing pipelines to produce contradictory results. The error, hidden for years, eroded effect sizes and inflated false positives.
Science

One Funder’s Single-Subject Cost Cap Shrank Rodent Neuroimaging Cohorts by a Quarter

By Renu Shah/Jun 11, 2026

A major charity's US$1,500-per-animal cap on rodent imaging costs reduced cohort sizes by roughly 25% across labs, undermining statistical power for small-effect studies.
Science

One Untracked Detector Bias Voltage Shift Compromised a Dark Matter Search

By Jonas Eriksen/Jun 11, 2026

A 0.3% drift in photomultiplier bias voltage at the LUX-ZEPLIN detector mimicked a dark matter signal, hiding a true WIMP signal for years. A graduate student's forensic analysis of telemetry logs revealed the flaw.
Science

One 0.003 Arcsecond Star Tracker Error Mapped a Planet to the Wrong Star

By Karim Osman/Jun 11, 2026

A tiny star tracker glitch in Gaia led astronomers to misattribute an exoplanet to the wrong star. The error, 0.003 arcseconds, wasted years of follow-up and reshaped how the field vets astrometric data.
Science

One Unreported Precatalyst Activation Step Doubled a Cross-Coupling Yield

By Renu Shah/Jun 11, 2026

A trace ammonium chloride contaminant stabilizes a Ni(I) dimer intermediate, doubling the yield of a nickel-catalyzed C–N coupling reaction. The finding explains why many published yields may be underestimates.
Science

One Uncalibrated Two-Photon Microscope Laser Priced a Lab Out of Longitudinal Imaging

By Alice Chen/Jun 11, 2026

A single uncalibrated laser can halt longitudinal imaging for months, revealing how equipment costs distort neuroscience research and funding.
Science

One Grant Agency’s Per-Cage Fee Rule Halved Primate Social Behavior Studies

By Renu Shah/Jun 11, 2026

A per-cage fee hike by the US National Institutes of Health inadvertently halved primate social behavior research, shifting incentives toward single housing and altering the course of behavioral neuroscience.
Science

One Grant Agency’s No-Ship-Core Rule Forced a Pacific Sediment Transect Rethink

By Karim Osman/Jun 11, 2026

A grant agency's ban on ship-based coring mid-campaign forced a Pacific sediment transect to rely on autonomous gliders. An independent audit later revealed major gaps in the data, leading to a hybrid approach that improved quality and cut costs.
Science

One Untracked Anode Porosity Parameter Biased Three Battery Capacity Studies

By Karim Osman/Jun 11, 2026

A single unmeasured porosity parameter inflated capacity gains in three battery studies from 2022–2024, exposing a reproducibility gap in materials science.
Science

One Unanalyzable Python Script Blocked a Computational Epidemiology Paper for Two Years

By Jonas Eriksen/Jun 11, 2026

A single Python script with no docstrings and hardcoded paths held a computational epidemiology paper in peer review for two years. The story reveals how funding incentives, infrastructure costs, and journal practices discourage code hygiene.
Science

One Untuned Interferometer Port Fixed a Dark Matter Search Null Result

By Renu Shah/Jun 11, 2026

A null result in a dark matter search was traced to a mis-set optical interferometer port. A cross-disciplinary fix from quantum optics and LIGO's port-tuning methods resolved the issue, turning a null into candidate events.
Science

One Unpublished Polymerization Catalyst Recipe Doubled a Battery Lab’s Anode Capacity

By Renu Shah/Jun 11, 2026

A single unpublished catalyst recipe doubled a battery lab's anode capacity from ~360 to ~720 mAh/g. This feature explains the chemistry, evidence, and limitations of the method.
Science

One Unarchived Monte Carlo Seed Code Collapsed a Galaxy Formation Simulation

By Alice Chen/Jun 11, 2026

A missing Monte Carlo seed code made a galaxy formation simulation irreproducible, costing millions of CPU-hours and spurring new archiving standards across computational science.
Science

One Grant Agency’s Per-Animal Cost Limit Cut Rodent Neuroimaging Cohorts by a Third

By Renu Shah/Jun 11, 2026

A single agency's per-animal cost cap forced rodent neuroimaging labs to shrink cohorts by a third, eroding statistical power and shifting research toward cheaper but narrower methods.
Science

One Unversioned Climate Model Parameter Produced 3 °C Spread in 2100 Projections

By Alice Chen/Jun 11, 2026

A single unversioned parameter controlling ice nucleation in cloud models generated a 3°C spread in 2100 temperature projections, revealing deep reproducibility challenges in computational climate science.