A system developed by Google’s DeepMind has set a new record for AI performance on geometry problems. DeepMind’s AlphaGeometry managed to solve 25 of the 30 geometry problems drawn from the International Mathematical Olympiad between 2000 and 2022.
That puts the software ahead of the vast majority of young mathematicians and just shy of IMO gold medalists. DeepMind estimates that the average gold medalist would have solved 26 out of 30 problems. Many view the IMO as the world’s most prestigious math competition for high school students.
“Because language models excel at identifying general patterns and relationships in data, they can quickly predict potentially useful constructs, but often lack the ability to reason rigorously or explain their decisions,” DeepMind writes. To overcome this difficulty, DeepMind paired a language model with a more traditional symbolic deduction engine that performs algebraic and geometric reasoning.
The research was led by Trieu Trinh, a computer scientist who recently earned his PhD from New York University. He was a resident at DeepMind between 2021 and 2023.
Evan Chen, a former Olympiad gold medalist who evaluated some of AlphaGeometry’s output, praised it as “impressive because it’s both verifiable and clean.” Whereas some earlier software generated complex geometry proofs that were hard for human reviewers to understand, the output of AlphaGeometry is similar to what a human mathematician would write.
AlphaGeometry is part of DeepMind’s larger project to improve the reasoning capabilities of large language models by combining them with traditional search algorithms. DeepMind has published several papers in this area over the last year.
How AlphaGeometry works
Let’s start with a simple example shown in the AlphaGeometry paper, which was published by Nature on Wednesday:
The goal is to prove that if a triangle has two equal sides (AB and AC), then the angles opposite those sides will also be equal. We can do this by creating a new point D at the midpoint of the third side of the triangle (BC). It’s easy to show that all three sides of triangle ABD are the same length as the corresponding sides of triangle ACD. And two triangles with equal sides always have equal angles.
Geometry problems from the IMO are much more complex than this toy problem, but fundamentally, they have the same structure. They all start with a geometric figure and some facts about the figure like “side AB is the same length as side AC.” The goal is to generate a sequence of valid inferences that conclude with a given statement like “angle ABC is equal to angle BCA.”
For many years, we’ve had software that can generate lists of valid conclusions that can be drawn from a set of starting assumptions. Simple geometry problems can be solved by “brute force”: mechanically listing every possible fact that can be inferred from the given assumption, then listing every possible inference from those facts, and so on until you reach the desired conclusion.
But this kind of brute-force search isn’t feasible for an IMO-level geometry problem because the search space is too large. Not only do harder problems require longer proofs, but sophisticated proofs often require the introduction of new elements to the initial figure—as with point D in the above proof. Once you allow for these kinds of “auxiliary points,” the space of possible proofs explodes and brute-force methods become impractical.
So, mathematicians must develop an intuition about which proof steps will likely lead to a successful result. DeepMind’s breakthrough was to use a language model to provide the same kind of intuitive guidance to an automated search process.
The downside to a language model is that it is not great at deductive reasoning—language models can sometimes “hallucinate” and reach conclusions that don’t actually follow from the given premises. So, the DeepMind team developed a hybrid architecture. There’s a symbolic deduction engine that mechanically derives conclusions that logically follow from the given premises. But periodically, control will pass to a language model that will take a more “creative” step, like adding a new point to the figure.
What makes this tricky is that it takes a lot of data to train a new language model, and there aren’t nearly enough examples of difficult geometry problems. So, instead of relying on human-designed geometry problems, Trinh and his DeepMind colleagues generated a huge database of challenging geometry problems from scratch.
To do this, the software would generate a series of random geometric figures like those illustrated above. Each had a set of starting assumptions. The symbolic deduction engine would generate a list of facts that follow logically from the starting assumptions, then more claims that follow from those deductions, and so forth. Once there was a long enough list, the software would pick one of the conclusions and “work backwards” to find the minimum set of logical steps required to reach the conclusion. This list of inferences is a proof of the conclusion, and so it can become a problem in the training set.
Sometimes a proof would reference a point in the figure, but the proof didn’t depend on any initial assumptions about that point. In those cases, the software could remove that point from the problem statement but then introduce the point as part of the proof. In other words, it could treat this point as an “auxiliary point” that needed to be introduced to complete the proof. These examples helped the language model to learn when and how it was helpful to add new points to complete a proof.
In total, DeepMind generated 100 million synthetic geometry proofs, including almost 10 million that required introducing “auxiliary points” as part of the solution. During the training process, DeepMind placed extra emphasis on examples involving auxiliary points to encourage the model to take these more creative steps when solving real problems.