Discover how companies are responsibly integrating AI in production. This invite-only event in SF will explore the intersection of technology and business. Find out how you can attend here.
In the age of generative AI, the safety of large language models (LLMs) is just as important as their performance at different tasks. Many teams already realize this and are pushing the bar on their testing and evaluation efforts to foresee and fix issues that could lead to broken user experiences, lost opportunities and even regulatory fines.
But, when models are evolving so quickly in both open and closed-source domains, how does one determine which LLM is the safest to begin with? Well, Enkrypt has the answer: a LLM Safety Leaderboard. The Boston-based startup, known for offering a control layer for the safe use of generative AI, has ranked LLMs from best to worst, based on their vulnerability to different safety and reliability risks.
The leaderboard covers dozens of top-performing language models, including the GPT and Claude families. More importantly, it provides some interesting insights into risk factors that might be critical in choosing a safe and reliable LLM and implementing measures to get the best out of them.
Understanding Enkrypt’s LLM Safety Leaderboard
When an enterprise uses a large language model in an application (like a chatbot), it runs constant internal tests to check for safety risks like jailbreaks and biased outputs. Even a tiny error in this approach could leak personal information or return biased output, like what happened with Google’s Gemini chatbot. The impact could be even bigger in regulated industries like fintech or healthcare.
Founded in 2023, Enkrypt has been streamlining this problem for enterprises with Sentry, a comprehensive solution that identifies vulnerabilities in gen AI apps and deploys automated guardrails to block them. Now, as the next step in this work, the company is extending its red teaming offering with the LLM Safety Leaderboard that provides insights to help teams begin with the safest model in the first place.
The offering, developed after rigorous tests across diverse scenarios and datasets, provides a comprehensive risk score for as many as 36 open and closed-source LLMs. It considers multiple safety and security metrics, including the model’s ability to avoid generating harmful, biased or inappropriate content and its potential to block out malware or prompt injection attacks.
Who wins the safest LLM award?
As of May 8, Enkrypt’s leaderboard presents OpenAI’s GPT-4-Turbo as the winner with the lowest risk score of 15.23. The model defends jailbreak attacks very effectively and provides toxic outputs just 0.86% of the time. However, issues of bias and malware did affect the model 38.27% and 21.78% of the time.
The next best on the list is Meta’s Llama2 and Llama 3 family of models, with risk scores ranging between 23.09 and 35.69. Anthropic’s Claude 3 Haiku also sits 10th on the leaderboard with a risk score of 34.83. According to Enkrypt, it does decently across all tests, except for bias, where it provided unfair answers over 90% of the time.
Notably, the last on the leaderboard are Saul Instruct-V1 and Microsoft’s recently announced Phi3-Mini-4K models with risk scores of 60.44 and 54.16, respectively. Mixtral 8X22B and Snowflake Arctic also rank low – 28 and 27 – in the list.
However, it is important to note that this list will change as the existing models improve and new ones come to the scene over time. Enkrypt plans to update the leaderboard regularly to show the changes.
“We are updating the leaderboard on Day Zero with most new model launches. For model updates, the leaderboard will be updated on a weekly basis. As AI safety research evolves and new techniques are developed, the leaderboard will provide regular updates to reflect the latest findings and technologies. This ensures that the leaderboard remains a relevant and authoritative resource,” Sahi Agarwal, the co-founder of Enkrypt, told VentureBeat.
Eventually, Agarwal hopes this evolving list will give enterprise teams a way to delve into the strengths and weaknesses of each popular LLM – whether it’s avoiding bias or blocking prompt injection – and use that to decide on what would work best for their targeted use case.
“Integrating our leaderboard into AI strategy not only boosts technological capabilities but also upholds ethical standards, offering a competitive edge and building trust. The risk/safety/governance team within an enterprise would use the Leaderboard to provision which models are safe to use by the product and engineering teams. Currently, they do not have this level of information from a safety perspective – only public performance benchmark numbers. The leaderboard and red team assessment reports guide them with safety recommendations for the models when deployed,” he added.