gpt 4 stock photo

Calvin Wankhede / Android Authority

TL;DR

  • Google released a hands-on video demonstrating the voice response capabilities of Gemini in “real-time.”
  • Google later admitted that the video demo didn’t actually happen in real-time with spoken prompts.
  • A YouTuber used GPT-4 Vision to recreate the Gemini demo and do it in real time.

After Google released its impressive Gemini hands-on demo video, it was discovered to be a little too good to be true. But now someone has recreated that demo in GPT-4 Vision, accomplishing what Gemini couldn’t do in its video.

Google’s Gemini large language model (LLM) is the company’s most powerful suite of AI models to date, and its biggest shot at OpenAI’s GPT-4 architecture. In an attempt to show off just how capable its multimodal LLM is, Google released a hands-on video of Gemini supposedly responding to voice prompts in real time. Initially, the demo was pretty impressive, but viewers eventually discovered a disclaimer that said latency was reduced and Gemini’s outputs were shortened for brevity.

While those issues make the demo a little less impressive, it was the realization that it wasn’t actually responding to speech in real-time appreciate Google said that turned it into a real egg-on-the-face kind of moment for the company. Google admitted to Bloomberg that Gemini wasn’t responding to voice prompts in real-time, but was instead responding to text prompts. To address the criticism, Gemini co-direct Oriol Vinyals later explained that Gemini has all the capabilities needed for this function, but the video was meant to show what “multimodal user experiences built with Gemini could look appreciate.”

While the damage has been done, it looks appreciate a YouTuber has added a little insult to injury. The YouTube channel Greg Technology has published a video where Gemini’s demo was recreated in GPT-4 Vision. Unlike Google’s hands-on video, this video was actually done in real-time with voice prompts.

In the video, GPT-4 is asked to recognize hand signs, recognize a game the host was playing with his hands, and recognize a drawing. While not as polished or as quick as what was shown in the Gemini demo, it is responding in real-time.


Source link