It’s time to celebrate the incredible women leading the way in AI! Nominate your inspiring leaders for VentureBeat’s Women in AI Awards today before June 18. Learn More
Google Gemini is just 6 months old, but it has already shown impressive capabilities across security, coding, debugging and other areas (of course, it has exhibited serious limitations, too).
Now, the large language model (LLM) is outperforming humans when it comes to sleep and fitness advice.
Researchers at Google have introduced the Personal Health Large Language Model (PH-LLM), a version of Gemini fine-tuned to understand and reason on time-series personal health data from wearables such as smartwatches and heart rate monitors. In their experiments, the model answered questions and made predictions noticeably better than experts with years of experience in the health and fitness fields.
“Our work…employs generative AI to expand model utility from only predicting health states to also providing coherent, contextual and potentially prescriptive outputs that depend on complex health behaviors,” the researchers write.
VB Transform 2024 Registration is Open
Join enterprise leaders in San Francisco from July 9 to 11 for our flagship AI event. Connect with peers, explore the opportunities and challenges of Generative AI, and learn how to integrate AI applications into your industry. Register Now
Gemini as a sleep and fitness expert
Wearable technology can help people monitor and, ideally, make meaningful changes to their health. These devices provide a “rich and longitudinal source of data” for personal health monitoring that is “passively and continuously acquired” from inputs including exercise and diet logs, mood journals and sometimes even social media activity, the Google researchers point out.
However, the data they capture around sleep, physical activity, cardiometabolic health and stress is rarely incorporated into clinical settings that are “sporadic in nature.” Most likely, the researchers posit, this is because data is captured without context and requires a lot of computation to store and analyze. Further, it can be difficult to interpret.
Also, while LLMs have done well when it comes to medical question-answering, analysis of electronic health records, diagnosis based on medical images and psychiatric evaluations, they often lack the ability to reason about and make recommendations on data from wearables.
However, the Google researchers made a breakthrough in training PH-LLM to make recommendations, answer professional examination questions and predict self-reported sleep disruption and results of sleep impairment. The model was given multiple-choice questions, and researchers also performed chain-of-thought (mimicking human reasoning) and zero-shot methods (recognizing objects and concepts without having encountered them before).
Impressively, PH-LLM achieved 79% in the sleep exams and 88% in the fitness exam — both of which exceeded average scores from a sample of human experts, including five professional athletic trainers (with 13.8 years average experience) and five sleep medicine experts (with an average of experience of 25 years). The humans achieved an average score of 71% in fitness and 76% in sleep.
In one coaching recommendation example, researchers prompted the model: “You are a sleep medicine expert. You are given the following sleep data. The user is male, 50 years old. List the most important insights.”
PH-LLM replied: “They are having trouble falling asleep…adequate deep sleep [is] important for physical recovery.” The model further advised: “Make sure your bedroom is cool and dark…avoid naps and keep a consistent sleep schedule.”
Meanwhile, when asked a question about what type of muscular contraction occurs in the pectoralis major “during the slow, controlled, downward phase of a bench press.” Given four choices for an answer, PH-LLM correctly responded “eccentric.”
For patient-recorded incomes, researchers asked the model: “Based on this wearable data, would the user report having difficulty falling asleep?”, to which it replied, “This person is likely to report that they experience difficulty falling asleep several times over the past month.”
The researchers note: “Although further development and evaluation are necessary in the safety-critical personal health domain, these results demonstrate both the broad knowledge base and capabilities of Gemini models.”
Gemini can offer personalized insights
To achieve these results, the researchers first created and curated three datasets that tested personalized insights and recommendations from captured physical activity, sleep patterns and physiological responses; expert domain knowledge; and predictions around self-reported sleep quality.
They created 857 case studies representing real-world scenarios around sleep and fitness — 507 for the former and 350 for the latter — in collaboration with domain experts. Sleep scenarios used individual metrics to identify potential causing factors and provide personalized recommendations to help improve sleep quality. Fitness tasks used information from training, sleep, health metrics and user feedback to create recommendations for intensity of physical activity on a given day.
Both categories of case studies incorporated wearable sensor data — for up to 29 days for sleep and over 30 days for fitness — as well as demographic information (age and gender) and expert analysis.
Sensor data included overall sleep scores, resting heart rates and changes in heart rate variability, sleep duration (start and end time), awake minutes, restlessness, percentage of REM sleep time, respiratory rates, number of steps and fat burning minutes.
“Our study shows that PH-LLM is capable of integrating passively-acquired objective data from wearable devices into personalized insights, potential causes for observed behaviors and recommendations to improve sleep hygiene and fitness outcomes,” the researchers write.
Still much work to be done in personal health apps
Still, the researchers acknowledge, PH-LLM is just the start, and like any emerging technology, it has bugs to be worked out. For instance, model-generated responses were not always consistent, there were “conspicuous differences” in confabulations across case studies and the LLM was sometimes conservative or cautious in its responses.
In fitness case studies, the model was sensitive to over-training, and, in one instance, human experts noted its failure to identify under-sleeping as a potential cause of harm. Also, case studies were sampled broadly across demographics and relatively active individuals — so they likely weren’t fully representative of the population, and couldn’t address more broad-ranging sleep and fitness concerns.
“We caution that much work remains to be done to ensure LLMs are reliable, safe and equitable in personal health applications,” the researchers write. This includes further reducing confabulations, considering unique health circumstances not captured by sensor information and ensuring training data reflects the diverse population.
All told, though, the researchers note: “The results from this study represent an important step toward LLMs that deliver personalized information and recommendations that support individuals to achieve their health goals.”