These geometry problems are composed from templates in our relation library, corresponding to easy, medium, and hard difficulty levels, respectively. For visual clarity, this figure has modified colors, font sizes, and line thicknesses compared to the original images in our constructed dataset; please refer to the original dataset for precise details.
Multimodal large language models have various practical applications that demand strong reasoning abilities. Despite recent advancements in this area, these models still struggle to solve complex geometric problems. A key challenge stems from the lack of high-quality image-text pair datasets for understanding geometric images. Furthermore, most template-based data synthesis pipelines typically fail to generalize to questions outside their predefined templates. In this paper, we mitigate this issue by introducing a complementary RLHF process into the data generation pipeline. By adopting RAFT to adjust captions for image-text pairs generated from fewer than 50 templates and using reward signals derived from downstream mathematical problem-solving tasks, our pipeline successfully captures the key features of geometry problem-solving. This enables better task generalization and yields non-trivial improvements. Furthermore, the generated dataset also enhances the general mathematical reasoning capabilities of multimodal large language models beyond the domain of geometric mathematical problems, yielding accuracy improvements of 2.8%–5.3% in arithmetic, algebraic, and numerical tasks with even non-geometric input images.
We train base models on diverse mathematical datasets, including both captioning (AutoGeo, GeoPeP, and our proposed GeoReasoning) and reasoning (GeoGPT4V, Geo170K) datasets. Besides, we train base models on various sizes of these datasets. Then we evaluate these trained models on commonly-used mathematical benchmarks (MathVista and MathVerse).
Result 1: The model trained on GeoReasoning outperforms those on other datasets, indicating the high quality of GeoReasoning and its superior property to improve reasoning capacity.
Result 2: The models trained on various sizes of GeoReasoning exhibit obvious scaling behaviour.
We train base models on GeoReasoning and evaluate the performance on MMMU to test its generalization property.
Result 1: The model trained on GeoReasoning outperforms the base model on various domains like art, science, engineering, etc., indicating that GeoReasoning can boost the reasoning capacity on diverse domains.
Result 2: The model trained on GeoReasoning outperforms those on other datasets, revealing the stronger generalization property of GeoReasoning.
We first pre-define a relation library composed of approximately 50 basic relations and the corresponding clause library.
We then randomly sample relations and the corresponding clauses, after which we check the compatibility of them.
In this way we can generate geometry problems of arbitrary complexity and difficulty.
Since the generation is fast, we are able to efficiently and scalably generate a large volume of samples.
We evaluate the generalization property of GeoReasoning compared to the base model on MMMU.
The model tuned on GeoReasoning outperforms the baseline on most of the domains, indicating the generalization capacity of the model as well as the generalizaiton property of the proposed dataset.
Dataset | Type | MMMU |
---|---|---|
None | N/A | 43.3±0.7 |
AutoGeo | Caption | 43.5±0.5 |
GeoPeP | Caption | 43.7±0.6 |
GeoGPT4V | QA | 44.0±0.9 |
Geo170K | QA | 42.9±1.0 |
GeoReasoning | Caption | 44.9±0.7 |
The model trained on GeoReasoning outperforms those on other datasets, revealing its stronger generalization property.
We evaluate the in-domain performance of GeoReasoning compared to other commonly-used datasets on MathVista and MathVerse.
The accuracy of models SFTed on diverse capacities and datasets evaluated on downstream benchmarks
The model trained on GeoReasoning improves progressively with increasing dataset size on downstream benchmarks in general. Besides, it exihibits a much superior scalability compared to other baselines.
Dataset | Type | MathVerse | MathVistae |
---|---|---|---|
AutoGeo | Caption | 24.5±0.4 | 47.6±0.5 |
GeoPeP | Caption | 24.2±0.2 | 47.7±0.4 |
GeoGPT4V | QA | 25.2±0.5 | 47.5±0.2 |
Geo170K | QA | 25.3±0.1 | 47.6±0.3 |
GeoReasoning | Caption | 26.0±0.3 | 48.8±0.2 |
The model trained on GeoReasoning has better reasoning ability compared to that trained on other caption datasets and even outperforms other multimodal reasoning datasets. This observation demonstrates a more pronounced ability of GeoReasoning to enhance the model's reasoning capabilities.
The geometry data synthesis pipeline, where a graph-based representation similar to AutoGeo is employed for generating the final geometry images. The relation library comprises over 50 basic geometric relationships that can be composed into complex ones, providing comprehensive coverage for geometric problems of various difficulties. The image-caption pair is utilized for the SFT stage, while the caption-QA pair for the RLVR stage.
The workflow of Geo-Image-Textualization method. In Stage 1, the model is trained to develop a preliminary ability to generate image captions. In Stage 2, an alternating optimization strategy is employed to jointly refine the generated captions and enhance the model's overall performance. The data of Stage 1 comes from the rule-based image-caption generation pipeline.
The reward modeling includes two reward functions, the caption reward and the reasoning reward. The former is a weighted sum of the ROUGE and BLEU scores. For the latter, we construct a series of prompts by concatenating each pre-generated question with its corresponding candidate caption, after which we feed them into a language model to generate answers and compare against the ground truth to get the correctness reward.
Here are some samples on geometry, arithmetic, numeric, and scientific domains from MathVista.
These cases demonstrate that the geometry captioning task boosts the mathematical reasoning capacity of base models.
Here are some samples on engineering, physics, and economics domains from MMMU.
These cases indicate that the model trained on GeoReasoning is more detailed and accurate in observing shape and develops superior spatial reasoning ability.
@misc{georeasoning,
title={Generalizable Geometric Image Caption Synthesis},
author={Yue Xin and Wenyuan Wang and Rui Pan and Ruida Wang and Howard Meng and Shizhe Diao and Renjie Pi and Tong Zhang},
year={2025},
eprint={2509.15217},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2509.15217},
}