Rethinking Statistics and Causality: Why Mechanisms Cannot Be Inferred from Data Distributions

1National Taiwan University
Project teaser

TL/DR:   Persistent failures of statistical and causal inference to match real-world behavior are not merely methodological, but structural: observed data distributions are projections that do not preserve the properties of the underlying system. As a result, mechanisms are fundamentally non-identifiable from these distributions. Statistical inference reduces to geometric alignment in projected space, and causal inference extends the same mistake by treating probabilistic structure as causal structure.

Psychology Mismatch

Projection: Data Does Not Preserve Structure
Like a shadow of a 3D object—you can’t tell the shape from the shadow alone.

Economics Mismatch

Geometry: Correlation Is Just an Angle
Two arrows can form the same angle—even if they come from completely different systems.

Biomedicine Mismatch

Probability: Structure Imposed by Mathematics
Just because numbers multiply doesn’t mean events do.

Why Statistics & Causal Inference Need Rethinking

Across psychology, neuroscience, and applied economics, many findings fail to replicate or disappear outside the lab. Even without fraud, statistically “significant” effects often fail to match real-world behavior. These breakdowns are not discipline-specific—they reflect a deeper mismatch between how real systems generate phenomena and how statistical inference interprets the data left behind.

In practice, statistics and causal inference operate entirely within data space. But data are projections of a much higher-dimensional system, and such projections do not preserve the original structure. Once mapped into data space, the properties of the underlying system are no longer retained, and the mechanism is not present in the observations.

Many failures stem from two assumptions: that alignment in data space reflects structure in the underlying system, and that probabilistic factorizations reveal the process that generated the outcomes. Yet these tools operate in a domain where the mechanism is no longer present, which is why they break even when used correctly.

The Two Roots of the Error

Two foundational mistakes that pushed statistics and causal inference off course.

🧭

The underlying mechanisms were unknown

Many fields did not know how the underlying system works. Without a clear understanding of the system, inference defaulted to geometry and statistics—treating data-space patterns as if they revealed real structure, even though they do not.

No mechanism → no grounding

📉

Minimal Data, Big Conclusions

Many fields relied on minimal data. With only small, sparse, convenience samples, inference collapsed into geometric patterns—treating limited observations as if they captured the structure of the world, even though they do not.

Sparse data ≠ world structure

Three Structural Mismatches

Projection, geometry, and probability fail for the same reason: they operate on representations that do not preserve the structure of the system.

🧭

Projection Does Not Preserve Structure

Observations are projections of a higher-dimensional system. Projection does not preserve the original properties, so the structure of the system is not retained in data space. The mechanism is not in the observations.

A shadow is not the object

📐

Correlation Is Just Geometry

Correlation is the cosine between vectors—pure geometry. This alignment exists only within a chosen data space, and does not reflect the structure of the underlying system. Unrelated variables can appear aligned, while real relations can disappear.

Angle is not relationship

Probability Imposes Structure

Probabilistic operations—multiplication, division, conditioning— impose algebraic structure on data. These operations do not necessarily correspond to real processes in the world. There is no clear mapping from these operations to actual mechanisms.

No mapping to real-world events

Abstract

Statistical and causal inference have become universal currencies of explanation across the sciences, particularly in domains where underlying mechanisms remain opaque. Their apparent rigor—spanning psychology, economics and biomedicine—rests on the assumption that patterns within data can reveal the processes that generate them. Yet persistent mismatches between empirical predictions and real-world behaviour expose a deeper limitation: mechanisms cannot be inferred from data distributions alone.

To address this limitation, we revisit the foundations of both paradigms, showing how statistical inference reduces explanation to geometric alignment, while causal inference, evolved from Bayes’ theorem and graphical models, extends this misstep by conflating probabilistic structure with causal truth. Both expose the same epistemic gap: data encode a lower-dimensional projection of structure, not the mechanism that generates it.

We argue that understanding the world follows two routes: one is data-driven, expanding models toward richer function classes to achieve high-precision prediction, as exemplified by modern deep learning; the other is mechanism-driven, proposing and testing structural hypotheses as in the physical sciences. A robust framework requires both: data-driven models for high-precision prediction, and mechanistic models for reconstructing how the world produces the data we observe.

Why It Matters?

These failures do not stay inside statistics. They propagate across every discipline that relies on data-space reasoning. Psychology, neuroscience, political science, and applied economics routinely draw conclusions from patterns that could never reveal how the system actually works. Entire literatures grow around correlations, regressions, and conditional probabilities whose quantities have no mechanism-level meaning.

Causal inference amplifies the problem. Its algebra multiplies probabilities into joint events that do not exist in real systems. Because observed variables are low-dimensional projections of a richer generative world, the true joint distribution is undefined—yet causal formulas fabricate it through factorization. The machinery operates on objects that have no semantic or generative counterpart in reality.

Machine learning and statistical learning theory inherit the same statistical ontology, but their actual behavior no longer matches their notation. Quantities like likelihoods, expected risk, factorizations, and losses survive as syntactic artifacts, even though the true system has no mechanism that corresponds to them. Models work through optimization heuristics and implementation patches, not through the semantics their formulas suggest. What runs in practice is not what the notation claims.

That is why the foundation must be rebuilt. As long as inference is performed in data space, every downstream field will continue scaling the same structural error — operating on quantities that do not correspond to the systems they aim to explain.

Cite This Work

@misc{diau_2025_17633314,
  author       = {Diau, Egil},
  title        = {Rethinking Statistics and Causality: Why
                   Mechanisms Cannot Be Inferred from Data
                   Distributions
                  },
  month        = nov,
  year         = 2025,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.17633314},
  url          = {https://doi.org/10.5281/zenodo.17633314},
}