GeoReasoning

Generalizable Geometric Image Caption Synthesis

Yue Xin*¹ Wenyuan Wang*¹ Rui Pan*¹ Ruida Wang¹
Howard Meng¹ Renjie Pi² Shizhe Diao² Tong Zhang¹

¹UIUC, ²NVIDIA

These geometry problems are composed from templates in our relation library, corresponding to easy, medium, and hard difficulty levels, respectively. For visual clarity, this figure has modified colors, font sizes, and line thicknesses compared to the original images in our constructed dataset; please refer to the original dataset for precise details.

Abstract

Multimodal large language models have various practical applications that demand strong reasoning abilities. Despite recent advancements, these models still struggle to solve complex geometric problems. A key challenge stems from the lack of high-quality image-text pair datasets for understanding geometric images. Furthermore, most template-based data synthesis pipelines typically fail to generalize to questions beyond their predefined templates. In this paper, we bridge this gap by introducing a complementary process of Reinforcement Learning with Verifiable Rewards (RLVR) into the data generation pipeline. By adopting RLVR to refine captions for geometric images synthesized from 50 basic geometric relations and using reward signals derived from mathematical problem-solving tasks, our pipeline successfully captures the key features of geometry problem-solving. This enables better task generalization and yields non-trivial improvements. Furthermore, even in out-of-distribution scenarios, the generated dataset enhances the general reasoning capabilities of multimodal large language models, yielding accuracy improvements of 2.8%-4.8% in statistics, arithmetic, algebraic, and numerical tasks with non-geometric input images of MathVista and MathVerse, along with 2.4%-3.9% improvements in Art & Design and Tech & Engineering tasks in MMMU.

Key Achievements

Open-Source

We are one of the earliest efforts to open-source image-statement alignment in AlphaGeometry to advance geometric problem solving, with a focus on efficiently constructing high-quality caption-image pairs.

The Relation Between AlphaGeometry and GeoReasoning

AlphaGeometry: Combining the intuition and conjecturing ability of neural networks with the logical reasoning of symbolic engines to solve complex geometry problems, especially theorem proofs at the Olympiad level.

GeoReasoning: Concentrating on creating a high-quality geometric dataset of varying difficulty with no information loss between geometric diagrams and their captions.

High Quality

GeoReasoning is a carefully constructed geometry captioning dataset consist of high quality image-caption pairs. It outperforms other counterpart datasets on downstream benchmarks and exhibits favorable scaling behaviour.

Experiments

We train base models on diverse mathematical datasets, including both captioning (AutoGeo, GeoPeP, and our proposed GeoReasoning) and reasoning (GeoGPT4V, Geo170K) datasets. Besides, we train base models on various sizes of these datasets. Then we evaluate these trained models on commonly-used mathematical benchmarks (MathVista and MathVerse).

Result 1: The model trained on GeoReasoning outperforms those on other datasets, indicating the high quality of GeoReasoning and its superior property to improve reasoning capacity.

Result 2: The models trained on various sizes of GeoReasoning exhibit obvious scaling behaviour.

Strong Generalization

The improvements brought by GeoReasoning are not limited to geometric tasks, but are also generalized to non-geometric mathematical tasks even non-mathematical domains like art and engineering.

Experiments

We train base models on GeoReasoning and evaluate the performance on MMMU to test its generalization property.

Result 1: The model trained on GeoReasoning outperforms the base model on various domains like art, science, engineering, etc., indicating that GeoReasoning can boost the reasoning capacity on diverse domains.

Result 2: The model trained on GeoReasoning outperforms those on other datasets, revealing the stronger generalization property of GeoReasoning.

Superior Scalability

The generated examples are composed from a pre-defined relation library with approximately 50 basic relations, allowing for the expansion to geometry problems of arbitrary complexity.

Generation Process

We first pre-define a relation library composed of approximately 50 basic relations and the corresponding clause library.

We then randomly sample relations and the corresponding clauses, after which we check the compatibility of them.

In this way we can generate geometry problems of arbitrary complexity and difficulty.

Since the generation is fast, we are able to efficiently and scalably generate a large volume of samples.

Experimental Results

Out-of-Domain Generalization

We evaluate the generalization property of GeoReasoning compared to the base model on MMMU.

mathvista — The accuracy of the base model and the model SFTed on GeoReasoning evaluated on MMMU

The model tuned on GeoReasoning outperforms the baseline on most of the domains, indicating the generalization capacity of the model as well as the generalizaiton property of the proposed dataset.

Table: Accuracy of the models evaluated on all domains of MMMU trained on 10K samples of various datasets over multiple random seeds

Dataset	Type	MMMU Overall	A&D	Busi	Sci	H&M	Human	Tech
Base	N/A	43.3±0.7	57.8±4.0	44.1±0.6	34.3±0.9	46.8±2.2	59.2±2.1	29.0±1.3
AutoGeo	Caption	43.5±0.5	59.3±1.4	43.3±1.1	34.9±1.3	47.4±1.1	58.9±1.5	30.7±2.6
GeoPeP	Caption	43.7±0.9	59.2±1.1	40.4±1.4	34.0±1.7	45.1±0.9	59.6±1.0	32.6±0.7
GeoGPT4V	QA	44.0±0.7	60.2±1.1	43.1±1.5	34.5±0.7	46.0±0.7	58.3±1.6	30.8±2.0
Geo170K	QA	42.9±1.0	58.5±0.8	43.6±1.4	30.9±2.0	46.8±2.2	59.9±2.2	30.9±1.6
GeoReasoning	Caption	44.9±0.7	60.2±2.0	44.5±2.5	36.0±2.0	46.7±1.1	60.0±0.5	32.9±1.3

The model trained on GeoReasoning outperforms those on other datasets, revealing its stronger generalization property.

In-Domain Improvements

We evaluate the in-domain performance of GeoReasoning compared to other commonly-used datasets on MathVista and MathVerse.

The accuracy of models SFTed on diverse capacities and datasets evaluated on downstream benchmarks

The model trained on GeoReasoning improves progressively with increasing dataset size on downstream benchmarks in general. Besides, it exihibits a much superior scalability compared to other baselines.

Table: Accuracy of the models evaluated on diverse domains of MathVista and MathVerse trained on 10K samples of various datasets over multiple random seeds

Dataset	Type	MathVista (↑)					MathVerse (↑)
Dataset	Type	Overall	Geometry	Algebra	Science	Statistic	Overall	Vision-Dominant	Text-Dominant	Text-Lite
Base	N/A	46.2	60.7	59.1	53.3	43.2	25.2	24.0	32.0	25.9
AutoGeo	Caption	47.8±0.8	62.3±2.4	60.2±1.9	52.5±1.2	44.1±0.9	24.6±0.4	22.3±1.4	35.2±0.7	26.7±1.3
GeoPeP	Caption	47.5±0.4	61.0±2.3	59.6±1.8	54.1±0.6	44.2±0.8	24.2±0.2	21.7±0.9	33.7±0.3	25.7±1.3
GeoGPT4	QA	47.5±0.2	60.5±0.7	59.3±1.3	54.1±1.5	44.6±1.0	25.2±0.5	22.4±0.8	36.4±1.4	26.9±1.0
Geo170K	QA	47.6±0.3	62.2±1.5	60.6±1.2	53.5±1.5	43.7±0.4	25.3±0.1	22.5±1.0	35.4±1.7	26.9±0.7
GeoReasoning	Caption	48.6±0.3	62.8±1.3	61.4±1.4	54.3±1.2	46.0±0.5	25.8±0.1	24.0±0.8	36.8±0.4	28.4±0.5

The model trained on GeoReasoning has better reasoning ability compared to that trained on other caption datasets and even outperforms other multimodal reasoning datasets. This observation demonstrates a more pronounced ability of GeoReasoning to enhance the model's reasoning capabilities.

Methodology

Data Generation Pipeline

The geometry data synthesis pipeline, where a graph-based representation is similar to AutoGeo, is employed for generating the final geometry images. The relation library comprises over 50 basic geometric relationships that can be composed into complex ones, providing comprehensive coverage for geometric problems of various difficulties. The image-caption pair is utilized for the SFT stage, while the caption-QA pair for the RLVR stage.

Training Pipeline

The workflow of Geo-Image-Textualization method. In Stage 1, the model is trained to develop a preliminary ability to generate image captions. In Stage 2, an alternating optimization strategy is employed to jointly refine the generated captions and enhance the model's overall performance. The data of Stage 1 comes from the rule-based image-caption generation pipeline.

Reward Modeling

The reward modeling includes two reward functions, the caption reward and the reasoning reward. The former is a weighted sum of the ROUGE and BLEU scores. For the latter, we construct a series of prompts by concatenating each pre-generated question with its corresponding candidate caption, after which we feed them into a language model to generate answers and compare against the ground truth to get the correctness reward.

Case Study

MathVista

Here are some samples on geometry, arithmetic, numeric, and scientific domains from MathVista.

These cases demonstrate that the geometry captioning task boosts the arithematic reasoning and geometric understanding capacity of base models.

MMMU

Here are some samples on engineering, physics, and economics domains from MMMU.

These cases indicate that the model trained on GeoReasoning is more detailed and accurate in observing shape and develops superior spatial reasoning and line-art reasoning ability.

BibTeX


        @misc{georeasoning,
            title={Generalizable Geometric Image Caption Synthesis}, 
            author={Yue Xin and Wenyuan Wang and Rui Pan and Ruida Wang and Howard Meng and Shizhe Diao and Renjie Pi and Tong Zhang},
            year={2025},
            eprint={2509.15217},
            archivePrefix={arXiv},
            primaryClass={cs.AI; cs.CV; cs.LG},
            url={https://arxiv.org/abs/2509.15217}, 
        }