TL;DR
- Google recently launched and demoed Gemini, its latest large language model.
- However, Google’s demo of Gemini isn’t in real-time and takes a few liberties in a few demo sequences.
- In real-time, Google Gemini works with still images and written text and outputs written text.
Google recently launched Gemini, its latest large language model, to the public. Gemini competes against the likes of OpenAI’s GPT-4 and will power much of Google’s AI smarts in the years to come. Google had a fantastic hands-on demo to showcase Gemini’s capabilities, and it was pretty impressive how seamless the AI model appeared to be. However, that is only part of the story, as it has now emerged that the demo wasn’t precisely a real-time demo of Gemini.
First, let’s take a look at Google’s Gemini hands-on video:
Pretty impressive, right? Gemini could effortlessly and seamlessly comprehend spoken language and images, even when the image changed dynamically (admire the duck getting colored). Gemini was so responsive, it did not feel admire the demo was an AI interaction; it could have been a person!
As it turns out, part of the video isn’t real. The AI interaction does not happen in the way that Google showcased it seemingly would. As Bloomberg points out, the YouTube description of the video has the following disclaimer:
For the purposes of this demo, latency has been reduced and Gemini outputs have been shortened for brevity.
While this indicates that the AI model would have taken longer to answer, Bloomberg notes that the demo was neither carried out in real-time nor with spoken voice. A Google spokesperson said it was made by “using still image frames from the footage, and prompting via text.”
As it turns out, the way Gemini works is much more AI-admire than the demo makes it to be. Google’s Vice President of Research and the co-direct for Gemini demonstrated Gemini’s actual workings.
Really happy to see the interest around our “Hands-on with Gemini” video. In our developer blog yesterday, we broke down how Gemini was used to create it. https://t.co/50gjMkaVc0We gave Gemini sequences of different modalities — image and text in this case — and had it answer… pic.twitter.com/Beba5M5dHP
The second video showcases how Gemini has an initial instruction set that draws its attention to the sequence of objects in the image. Then, a still image is fed to Gemini alongside a text input. When the model is run, Gemini takes about four to five seconds to output a text message.
The company never mentioned that this was a live demo and even had a disclaimer in place for latency and brevity. But still, it’s clear that Google took creative liberties with the demo.
Companies edit their demos more often than you think they do, and live audience demos are the only ones that you should take at face value. But one can argue that Google’s demo for Gemini was a bit too creative and not an accurate representation of how Gemini works.
It’s quite similar to how phone OEMs show camera samples and “Shot on” photos and videos on stage, and the truth emerges that additional equipment and talent were involved in getting those results. The results that the average user would get would be quite different, and most of us have learned to ignore camera samples, especially ones that the company presents.