A way to let robots learn by listening will make them more useful

Researchers at the Robotics and Embodied AI Lab at Stanford University set out to change that. They first built a system for collecting audio data, consisting of a gripper with a microphone designed to filter out background noise, and a GoPro camera. Human demonstrators used the gripper for a variety of household tasks, then used this data to train robotic arms how to execute the task on their own. The team’s new training algorithms help robots gather clues from audio signals to perform more effectively.

“Thus far, robots have been training on videos that are muted,” says Zeyi Liu, a PhD student at Stanford and lead author of the study. “But there is so much helpful data in audio.”

To test how much more successful a robot can be if it’s capable of “listening”, the researchers chose four tasks: flipping a bagel in a pan, erasing a whiteboard, putting two velcro strips together, and pouring dice out of a cup. In each task, sounds provide clues that cameras or tactile sensors struggle with, like knowing if the eraser is properly contacting the whiteboard, or if the cup contains dice or not.

After demonstrating each task a couple hundred times, the team compared the success rates of training with audio versus only training with vision. The results, published in a paper on arXiv which has not been peer-reviewed, were promising. When using vision alone in the dice test, the robot could only tell 27% of the time if there were dice in the cup, but that rose to 94% when sound was included.

It isn’t the first time audio has been used to train robots, Liu says, but it’s a big step toward doing so at scale. “We are making it easier to use audio collected ‘in the wild,’ rather than being restricted to collecting it in the lab, which is more time-consuming.”

The research signals that audio might become a more sought-after data source in the race to train robots with AI. Researchers are teaching robots quicker than ever before using imitation learning, showing them hundreds of examples of tasks being done instead of hand-coding each task. If audio could be collected at scale using devices like the one in the study, it could provide an entirely new “sense” to robots, helping them more quickly adapt to environments where visibility is limited or not useful.

“It’s safe to say that audio is the most understudied modality for sensing” in robots, says Dmitry Berenson, associate professor of robotics at the University of Michigan, who was not involved in the study. That’s because the bulk of robotics research on manipulating objects has been for industrial pick-and-place tasks, like sorting objects into bins. Those tasks don’t benefit much from sound, instead relying on tactile or visual sensors. But, as robots broaden into tasks in homes, kitchens, and other environments, audio will become increasingly useful, Berenson says.

Consider a robot trying to find which bag contains a set of keys, all with limited visibility. “Maybe even before you touch the keys, you hear them kind of jangling,” Berenson says. “That’s a cue that the keys are in that pocket, instead of others.”

Still, audio has limits. The team points out sound won’t be as useful with so-called soft or flexible objects like clothes, which don’t create as much usable audio. The robots also struggled with filtering out the audio of their own motor noises during tasks, since that noise was not present in the training data produced by humans. To fix it, the researchers needed to add robot sounds–whirs, hums and actuator noises–into the training sets so the robots could learn to tune them out.

The next step, Liu says, is to see how much better the models can get with more data, which could mean more microphones, collecting spatial audio, and adding microphones to other types of data-collection devices.