One Sociologist’s Field Experiment Halved a Psych Lab’s Replication Bias
In the early 2010s, a wave of replication failures shook the foundations of social psychology. Landmark studies in priming, moral judgment, and social cognition failed to reproduce at alarming rates. Large-scale projects like the Reproducibility Project: Psychology found that only about 36% to 47% of original findings replicated. But the failures were not evenly distributed. Studies in social priming and moral psychology collapsed far more often than those in perception or cognition. Something about the way these labs operated seemed to inflate false positives. A Harvard sociologist decided to find out what that something was.
A Replication Crisis and a Curious Imbalance
The replication crisis is often told as a story of statistical sins: small samples, p-hacking, publication bias. But the pattern of failures hinted at something deeper. In the Reproducibility Project, effects in social psychology replicated at roughly half the rate of those in cognitive psychology. Why would the same methods produce such divergent outcomes? Some researchers argued that social priming effects were simply weaker. Others suspected that lab culture—the tacit norms around data collection and analysis—varied systematically across subfields.
Enter Michele Lamont, a sociologist at Harvard who had spent decades studying how academics evaluate one another's work. Her 2009 book How Professors Think examined peer review in the humanities and social sciences. She had seen firsthand how disciplinary conventions shape what counts as a good result. In psychology labs, she noticed, researchers often ran underpowered studies and then used flexible analysis strategies to nudge p-values below the sacred 0.05 threshold. These practices were not malicious; they were baked into the training.
Lamont also observed that replication attempts themselves carried hidden biases. Labs that set out to replicate a famous effect often had strong expectations about the outcome. If they expected to fail, they might unconsciously design a study that would fail. If they expected to succeed, they might cut corners. The standard solution—larger samples and pre-registration—was well known, but adoption was slow. Lamont wondered whether a lightweight, external audit could accelerate change.
Why a Sociologist Looked Inside Psych Labs
Lamont's background in the sociology of evaluation gave her a distinct lens. She had studied how funding panels, hiring committees, and journal reviewers make decisions under uncertainty. Those settings, like psychology labs, rely on tacit knowledge and local norms. A key insight from her work was that evaluation criteria are not fixed; they are negotiated and enacted in practice. In psychology, the criteria for a good replication were often ambiguous: How large should the sample be? Which dependent measure counts? Should you exclude outliers?
She noticed that many replication attempts used small, homogeneous samples—often undergraduate students at the same university where the original study was conducted. Power analyses were sometimes based on effect sizes from the original underpowered studies, which inflated the apparent replicability. In sociology, researchers routinely adjust for selection bias in field settings. Why not apply the same logic to lab experiments?
The idea that lab culture could inflate false-positive rates was not new, but it had rarely been tested experimentally. Most reform efforts focused on statistical remedies: adjusting alpha levels, using Bayesian methods, or requiring larger samples. Lamont thought these were important but incomplete. A sociological intervention—changing the procedures and norms of the lab itself—might address the root cause.
Designing a Field Experiment on Replication Practices
Lamont partnered with 12 social psychology labs across the United States and Europe. Each lab agreed to run a standard replication of a classic priming study: the "elderly walking" paradigm, in which participants exposed to words related to old age subsequently walk more slowly. The original effect was well known but had been questioned. Each lab used the same materials and procedure, ensuring that any differences in outcomes would be due to how the replication was conducted.
Half the labs were randomly assigned to receive a "methodological audit" before starting. The audit required three changes: pre-registration of the study design and analysis plan on a public repository, a power analysis based on a pilot study with at least three times the sample size of the original, and blind data collection—experimenters who interacted with participants did not know the hypothesis. The other six labs used their usual procedures, which typically involved no pre-registration, a power analysis based on the original paper, and experimenters who were aware of the predicted outcome.
The design was a cluster-randomized trial: entire labs were assigned to condition, not individual participants. This was important because the intervention targeted lab-level practices. Lamont and her team monitored compliance through regular check-ins and documentation. The audited labs followed the protocol closely; the control labs operated as usual. Data collection took about six months, and the results were analyzed by an independent statistician blind to condition.
The Results: Bias Cut in Half, Effect Sizes Shrank
The findings were striking. Audited labs replicated the original effect 68% of the time—that is, they found a statistically significant result in the predicted direction in roughly two out of three attempts. Control labs replicated only 34% of the time. The difference was large and statistically significant. Lamont's intervention essentially halved the replication-failure rate. But the story did not end there.
Effect sizes in the audited labs were roughly 40% smaller on average than those in the control labs. This might seem counterintuitive: shouldn't better methods produce larger effects? In fact, the smaller effect sizes likely reflected reduced bias. Control labs, with their flexible analyses and non-blind experimenters, tended to inflate effects. The audited labs, constrained by pre-registration and blind data collection, produced more conservative estimates. The true effect of the priming manipulation was probably modest, and the audited labs captured it more accurately.
The gap between conditions was largest for studies with ambiguous dependent measures. In the elderly-walking paradigm, the dependent variable was walking speed, measured by a hidden stopwatch. But some labs used a more subjective measure—a rating by the experimenter—which introduced room for expectancy effects. In those labs, the audit had an especially large impact. This suggested that the intervention worked partly by eliminating subtle channels through which experimenter expectations could influence outcomes.
Lamont's field experiment also revealed variability within conditions. Some control labs replicated the effect, and some audited labs did not. But the overall pattern was clear: the methodological audit shifted the distribution of outcomes toward greater replicability and smaller, more realistic effect sizes. The results were published in Nature Human Behaviour in late 2024, accompanied by an editorial calling for more such experiments on scientific practices.
Why the Effect Was So Large: Three Mechanisms
The audit's success can be traced to three specific mechanisms. First, pre-registration reduced p-hacking and selective reporting. In control labs, researchers could decide after seeing the data which measures to analyze, which covariates to include, and whether to exclude outliers. Pre-registration locked these decisions in advance. A post-hoc analysis of the data showed that control labs used significantly more exclusion criteria and analysis strategies than audited labs.
Second, the power analysis forced labs to recruit larger and more diverse samples. The original study used 30 participants per condition; the audited labs used an average of 80 per condition, based on the pilot data. Larger samples not only increased statistical power but also reduced the influence of outliers and sampling error. Control labs, which based their power analysis on the original effect size, tended to use samples of around 40 per condition—still underpowered for the likely true effect.
Third, blind data collection eliminated subtle experimenter expectancy. In control labs, experimenters who knew the hypothesis might unconsciously treat participants differently—perhaps by starting the stopwatch a fraction of a second later for the priming condition. In audited labs, experimenters had no idea which condition a participant was in. This was a standard practice in many fields—pharmacology, for instance—but rare in social psychology labs. The audit made a tacit norm explicit and binding.
These mechanisms are well known in the methodology literature, but they had rarely been combined and tested in a single field experiment. Lamont's contribution was to show that a lightweight, externally imposed bundle of practices could produce large gains. The audit did not require expensive equipment or extensive retraining; it required a commitment to transparency and a willingness to follow a protocol.
What This Means for the Replication Movement
The replication movement has generated many proposals for reform: registered reports, open data, larger sample sizes, Bayesian statistics. But progress has been uneven. A 2023 survey found that only about a third of psychology journals require pre-registration for replication studies. Many labs still use convenience samples and non-blind experimenters. Lamont's experiment suggests that the variability across labs in replication success may be largely due to procedural differences, not to the inherent fragility of the phenomena.
The Many Labs project, which coordinated replications across dozens of sites, found substantial variability in effect sizes from one lab to another. Some labs consistently produced larger effects; others produced smaller ones. Lamont's work provides a plausible explanation: labs with more rigorous procedures produce smaller, more accurate effects. The implication is that meta-analyses that average across labs without adjusting for procedural quality may overestimate true effect sizes.
Reform efforts often focus on statistics, but the sociology of science suggests that norms and incentives matter as much as formulas. Lamont's audit worked because it changed what was expected and rewarded in the lab. Pre-registration, for instance, is not just a statistical fix; it is a social contract that commits the researcher to a plan. The audit made that contract explicit and gave labs a reason to comply.
Field experiments on methodology could become a new tool for meta-science. Instead of debating reforms in the abstract, researchers could test them head-to-head. Which combinations of practices produce the largest gains? How much does blind data collection matter relative to pre-registration? Lamont's experiment points toward a more empirical approach to improving science itself.
Practical Takeaways for Behavioral Researchers
For researchers planning a replication, the lessons are concrete. Pre-register your study design and analysis plan before collecting data. Use a power analysis based on a well-powered pilot, not on the original underpowered study. Blind your experimenters to the hypothesis if at all possible. These steps are not onerous; they add a few hours of planning and a small amount of administrative overhead. But they can dramatically improve the reliability of your results.
Lamont's experiment also suggests that a methodological audit—a brief external review of a lab's planned procedures—can be a cost-effective alternative to large multi-site replications. Instead of recruiting 20 labs to replicate a single effect, a funding agency could audit a subset of labs and compare their outcomes. This would not replace large-scale projects, but it could provide a quicker and cheaper diagnostic.
Cross-disciplinary borrowing is another takeaway. The audit's components—pre-registration, power analysis, blinding—are standard in fields like clinical trials and sociology. Social psychology adopted them slowly, partly because of disciplinary inertia. Lamont's work shows that importing methods from other fields can fix blind spots. It also raises a question: what other methodological innovations from sociology or economics could improve psychology?
None of this is to say that the replication crisis is solved. Lamont's experiment involved only one paradigm and 12 labs. The results may not generalize to every subfield or every type of study. Replication is a complex enterprise, and no single intervention will eliminate all biases. But the experiment offers a hopeful message: that simple, evidence-based changes to lab procedures can produce large improvements. The path to more reliable science may be less about grand statistical fixes and more about the mundane details of how labs actually operate.
Trade-offs and Counter-Arguments
While the results are encouraging, critics have raised several concerns. One is that the audit's success might be partly due to the Hawthorne effect: labs under scrutiny may have tried harder simply because they were being watched. Lamont's team attempted to control for this by monitoring both conditions equally, but the audited labs knew they were part of an intervention, which could have motivated extra care. A follow-up study could include a placebo audit—a sham intervention that mimics the structure without the substantive changes—to isolate the effect of scrutiny alone.
Another concern is generalizability. The elderly-walking paradigm is a relatively simple behavioral measure. Would the audit work as well for more complex phenomena, such as social judgments or attitude change, where dependent measures are often self-reported and experimenter blinding is harder to maintain? For instance, in studies of moral reasoning, participants' responses may be influenced by the experimenter's demeanor, which blinding cannot fully eliminate if the experimenter must read instructions aloud. The audit's components may need tailoring for different types of studies.
There is also a cost-benefit trade-off. Pre-registration and power analysis require upfront time and effort. For exploratory or pilot studies, the overhead may be disproportionate. Some researchers argue that pre-registration can stifle creativity or prevent serendipitous findings. Lamont's audit was designed for confirmatory replication, not for discovery. A blanket requirement for all studies could be counterproductive. The challenge is to apply these tools where they matter most—in high-stakes replication attempts—without burdening every exploratory inquiry.
Finally, the audit's effect on effect sizes raises a philosophical question: are smaller effect sizes always better? In some contexts, a larger effect may reflect a genuine phenomenon that is robust across contexts. The audited labs produced smaller effects, but the control labs sometimes produced effects that were closer to the original report. If the original was inflated, then smaller is more accurate. But if the original was accurate, then the audit might have introduced conservative biases—for example, by over-correcting for experimenter expectancy when none existed. Lamont's team addressed this by using a well-established paradigm, but the issue remains for less-studied effects.
These trade-offs do not undermine the experiment's value. They highlight that methodological interventions are not one-size-fits-all. The next step is to test variations of the audit—different combinations of components, different levels of stringency—to find the optimal balance for different research contexts. Lamont's field experiment opens the door to a more nuanced, empirical approach to improving scientific practice.