News Release

AI system reveals new insights into early language acquisition through the experience of a single child

Peer-Reviewed Publication

American Association for the Advancement of Science (AAAS)

A new machine learning model – trained on video and audio recorded from the first-person perspective of one young child for over a year – has provided new insights into early language acquisition. Not only do the findings offer a valuable framework to understand how children learn words and concepts, but they could be critical in developing artificial intelligence (AI) systems that can learn language in more human-like ways. Beginning around 6 to 9 months of age, children begin acquiring their first words, connecting spoken words to real-world objects and concepts. By the time they are 1.5 to 2 years of age, most children can comprehend an average of 300 words. However, how children acquire their first words and how these words become grounded to their visual counterparts is poorly understood. Although this topic is widely debated and several hypotheses have been proposed, early language acquisition has traditionally been examined in laboratory settings with findings that lack generalizability to real-world settings. Better understanding this process in children could inform next-generation multi-modal AI systems that develop links between words and visual representations.


Here, Wai Keen Vong and colleagues address these questions using a novel approach. They introduce the Child’s View for Contrastive Learning model (CVCL). Using longitudinal head-mounted camera recordings from a single child’s first-person experience over a 1.5-year period (age 6-25 months), Vong et al. trained the CVCL – a relatively generic neural network – on video frames (representing what the child was seeing) that co-occurred with child-directed linguistic utterances (what the child was hearing). Through this, the authors show that the model could learn word-referent mappings present in the child’s everyday experience. Even though the model was trained on a strict subset of actual naturalistic experiences, it was able to generalize beyond the specific visual objects seen in the child’s environment during training and align its visual and linguistic representations of them. According to Vong et al., the model – with limited sensory input and relatively generic learning mechanisms – provides a computational foundation for investigating how children acquire their first words and how those words can become grounded to the visual world. Despite the study’s conclusions, the authors highlight several limitations of their model in fully filling the gaps in the understanding of word learning in children.

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.