Public Release: 

Learning a scene that's unseen, without human help

American Association for the Advancement of Science

Scientists at Google DeepMind have developed a machine-learning system that can "observe" a scene from multiple angles and predict what the same space would look from an entirely different view - one not encountered during training. The system learns the 3-D composition of an environment, the authors report, using only a small number of 2-D sample images of the scene - and, critically, without human supervision. Called the Generative Query Network (GQN), it could pave the way toward machines that can learn about the world autonomously using their own sensors, without the need for training with datasets labeled by humans, a requirement of current computer vision systems. Built by Seyed Mohammadali Eslami and colleagues, the GQN consists of two parts: a representation network, which develops an encoded representation of the scene from the sample images, and a generation network, which outputs probable images of the scene from new viewpoints, accounting for uncertainty when parts of the scene are obscured. Eslami and colleagues trained the GCN using simple computer-generated environments containing various objects and lighting setups. It could then be given several images of a new scene and was able to generate predicted images of that scene from any viewpoint within it. The network's representations are "factorized," meaning properties like color, shape and size are learned and used separately. The researchers were able to construct new scenes by adding or subtracting the GQN's representations together; subtracting a scene containing a red sphere from a scene with a blue sphere and adding one with a red cylinder results in a scene with a blue cylinder, all without a human explicitly teaching the GQN the notions of color or shape. The network also shows promise as a way to control robotic devices; after training, its predictive abilities allow it to "observe" robotic arms, for example, from different angles using only one stationary camera, meaning less raw data is needed for accurate positioning and control. A related perspective by Matthias Zwicker comments on these findings.

###

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.