ChatGPT, arguably the most famous chatbot ever, learned its sometimes human-like conversational skills by parsing through absurd amounts of text data—millions of books, articles, Wikipedia pages, and everything else its creators could find by crawling around the Internet.
But what if an advanced AI could learn the way a little kid does, without reading 80 million books or looking at 97 million cats? Just making its first baby steps exploring an amazing new world under the patient guidance of mom and dad. A team of New York University researchers just gave it a shot, and it kind of worked.
Childhood memories
“The big thing this project speaks to is this classic debate on nurture versus nature. What is built into the child and what can be acquired through experience out in the world?” says Wai Keen Vong, a researcher at the NYU Center for Data Science. To find out, Vong and his team pushed an AI algorithm through the closest possible equivalent of early human childhood. They did this by feeding it a database called SAYCam-S, which is filled with first-person video footage taken by a camera strapped to a baby named Sam, recorded while Sam was doing usual baby things between the sixth and 25th month of his life.
“For our work we used a multimodal learning algorithm, which processed visual input—frames from the camera, and child-directed speech,” Vong explains. The algorithm was termed Child’s View for Contrastive Learning (CVCL); it worked by using a visual encoder and a language encoder to translate images and words into descriptive vectors. Then, a neural network analyzed these equations to find patterns and eventually learned to associate the right images with the right words. (It was a generic multimodal learning algorithm, nothing revolutionary.)
Based on just 61 of Sam’s waking hours—roughly one percent of the child’s experience—the AI learned to recognize sand, paper, puzzles, cars, and balls in images. It performed on par with standard image recognition algorithms that learned the usual way, through millions of examples. But it couldn’t figure out hands or rooms or baskets. Some things simply didn’t click here.
Imperfect slideshows
The problem was that AI didn’t perceive Sam’s experiences the way Sam did. Because the algorithm had access to individual frames annotated with transcribed speech, it saw them more like a very long slideshow and not a continuous experience. “This caused learning artifacts,” says Vong.
For example, it struggled with the word “hands” because hands were in most of the frames. Also, the parents used the word “hands” most often when Sam was at the beach. So, the AI confused “hands” with “sand,” Vong explains. The same thing applied to the word “room.” Sam spent most of his time indoors, and his parents didn’t constantly remind them that they are in a room.
Then, there was an issue of word frequency. Sam liked to play with balls, so he heard the word “ball” many times. He very rarely heard the word “basket,” though.
The AI also didn’t come to grips with the idea of movement. “The words associated with movement like “push,” “pull,” “twist”—all the verbs have a temporal element to them,” Vong says. “This is something we are actively working on, learning from videos. We already know that using videos instead of still frames leads to a bit better understanding of things that unfold over time,” he adds. The next version should have learning from continuous experiences sorted out.
Driving lessons
Obviously, teaching AIs to recognize balls in images has already been done before. So why is Vong’s team’s work such a big deal that it landed in Science, not some second-tier AI-specific publication? The answer is its potential to lay the groundwork for future advances.
It’s the first demonstration that AI can effectively learn from limited, individualized experience. It’s the difference between collecting a monstrous database of driving examples from hundreds of thousands of Teslas to teach an AI to drive a car and signing up a single Tesla for a few lessons with a driving instructor. The latter is simpler, faster, and infinitely cheaper.
We’re still far away from teaching machines the way we teach humans. “The model we used was passive; it was not designed to produce actions or provide any responses on its own,” says Vong.
Still, even this system has many avenues for improvement: using a database larger than 1 percent of the kid’s time, or adding information besides text and images—sound, smell, touch, emotional load, and so on could potentially be included. “But all this can be done by expanding the AI we already have and not starting from scratch,” Vong claims.
Which suggests we’re way less special than we thought. “Be it driving or language learning, humans are just way more sample-efficient than AIs. Big part of our work is to figure out what makes us so sample-efficient and how to use that to build smarter machines,” says Vong.
Jacek Krywko is a science and technology writer based in Olsztyn, Poland. He covers space exploration and artificial intelligence research.