image: The Multisensory Correlation Detector (MCD) population model consists of elementary computational units (left), each of which responds to audiovisual transients (that is, changes in the input) that correlate over time and space. A population of such units (right) can process real-life audiovisual stimuli and simulate multisensory perception.
Credit: Parise, eLife 2025 (CC BY 4.0)
A neural computation first discovered in insects has been shown to explain how humans combine sight and sound – even when illusions trick us into “hearing” what we do not see. Now, researcher Dr Cesare Parise from the University of Liverpool, UK, has created a biologically grounded model based on this computation, which can take in real-life audiovisual information instead of more abstract parameters used in previous models.
Parise’s research, published today in eLife as the final Version of Record after appearing previously as a Reviewed Preprint, is described by the editors as an important study with compelling evidence.
When we watch someone speak, our brains seamlessly weave together what we see and what we hear. This process is so automatic that mismatches can trick us. For example, in the famous McGurk illusion, dubbed sounds and lip movements produce an entirely new percept, while in the ventriloquist illusion, voices appear to come from the dummy’s mouth rather than the performer’s. But how do our brains know when a voice matches moving lips, or whether footsteps are in sync with the sound they produce?
Building models of multisensory integration that could explain how the brain combines information across vision and hearing has been a challenge for decades. While current models are mathematically powerful, they are biologically abstract and, instead of directly processing audiovisual signals, they rely on parameters defined by the experimenters.
Parise, who is a Senior Lecturer in Psychology at the University of Liverpool, UK, explains: “Despite decades of research in audiovisual perception, we still did not have a model that could solve a task as simple as taking a video as input and telling whether the audio would be perceived as in sync. This limitation reveals a deeper issue: without being stimulus-computable, perceptual models can capture many aspects of perception in theory, but can’t perform even the most straightforward real-world test.”
Parise’s model described in eLife overcomes this limitation. It is based on a neural computation first discovered in insects: their ability to detect movement is explained by the Hassenstein-Reichardt detector that identifies motion by correlating signals across neighbouring receptive fields.
In previous research, Parise and his colleague Marc Ernst hypothesised that the same principle – correlation detection – could underlie how the brain fuses signals across different senses. They developed the multisensory correlation detector (MCD), showing that it could replicate human responses to simple audiovisual sequences, such as sequences of flashes and clicks. They then refined the model to capture the fact that only transient information matters for audiovisual integration – in other words, the detector looks for changes in the input that correlate over time and space.
The current study takes this further. By simulating a population of MCDs arranged in a similar way to a lattice across visual and auditory space, Parise shows that this architecture can handle the complexity of real-life stimuli. The model reproduced the results of 69 classic experiments in humans, monkeys and rats, spanning spatial, temporal and attentional effects. “This represents the largest-scale simulation ever conducted in the field,” he says. “While other models have been tested extensively in the past, none have been tested against so many datasets in a single study.”
The model matched behaviour almost perfectly across species, and outperformed the dominant Bayesian Causal Inference model while requiring the same number of free parameters. Additionally, the MCD lattice could predict where participants would look while watching audiovisual movies, acting as a lightweight “saliency model”.
Parise says that the model’s implications may reach beyond the field of neuroscience into artificial intelligence. “Evolution has already solved the problem of aligning sound and vision with simple, general-purpose computations that scale across species and contexts,” he says. “The crucial step here is stimulus computability: because the model works directly on raw audiovisual signals, it can be applied to any real-world material.”
He continues: “Today’s AI systems still struggle to combine multimodal information reliably, and audiovisual saliency models depend on large, parameter-heavy networks trained on vast labelled datasets. By contrast, the MCD lattice is lightweight, efficient, and requires no training. This makes the model a powerful candidate for next-generation applications.
“What began as a model of insect motion vision now explains how brains – human or otherwise – integrate sound and vision across an extraordinary range of contexts,” Parise concludes. “From predicting illusions like the McGurk and ventriloquist effects to inferring causality and generating dynamic audiovisual saliency maps, it offers a new blueprint for both neuroscience and artificial intelligence research.”
This study builds upon the previous work by Cesare Parise and Marc Ernst, titled ‘Multisensory integration operates on correlated input from unimodal transient channels’.
Accompanying multimedia, including video illustrations of the McGurk and ventriloquist illusions, and audiovisual saliency maps, are available here.
Media contacts
Emily Packer, Media Relations Manager
eLife
+44 (0)1223 855373
About eLife
eLife transforms research communication to create a future where a diverse, global community of scientists and researchers produces open and trusted results for the benefit of all. Independent, not-for-profit and supported by funders, we improve the way science is practised and shared. In support of our goal, we introduced the eLife Model that ends the accept–reject decision after peer review. Instead, papers invited for review are published as Reviewed Preprints that contain public peer reviews and an eLife Assessment. We also continue to publish research that was accepted after peer review as part of our traditional process. eLife is supported by the Howard Hughes Medical Institute, Knut and Alice Wallenberg Foundation, the Max Planck Society and Wellcome. Learn more at https://elifesciences.org/about.
To read the latest Computational and Systems Biology research in eLife, visit https://elifesciences.org/subjects/computational-systems-biology.
And for the latest in Neuroscience, see https://elifesciences.org/subjects/neuroscience.
About the University of Liverpool
Founded in 1881 as the original ‘red brick’, the University of Liverpool is one of the UK’s leading research-intensive higher education institutions with an annual turnover of £708.3 million, including an annual research income of £163.1 million.
Now ranked in the top 150 universities worldwide (QS World Rankings 2026 and Times Higher Education World University Rankings 2026), we are a member of the prestigious Russell Group of the UK’s leading research universities and have a global reach and influence that reflects our academic heritage as one of the country’s largest civic institutions.
The latest UK rankings of circa 130 institutions have placed the University of Liverpool at 18th (Times and Sunday Times Good University Guide 2025), 22nd (2026 Guardian University Guide), 25th (Daily Mail University Guide 2025) and 23rd (2026 Complete University Guide) nationally.
Journal
eLife
Method of Research
Computational simulation/modeling
Article Title
Correlation detection as a stimulus computable account for audiovisual perception, causal inference, and saliency maps in mammals
Article Publication Date
4-Nov-2025