GeoReasoning

Generalizable Geometric Image Caption Synthesis

easy
An easy example
medium
A medium example
hard
A hard example

These geometry problems are composed from templates in our relation library, corresponding to easy, medium, and hard difficulty levels, respectively. For visual clarity, this figure has modified colors, font sizes, and line thicknesses compared to the original images in our constructed dataset; please refer to the original dataset for precise details.

Abstract

Multimodal large language models have various practical applications that demand strong reasoning abilities. Despite recent advancements in this area, these models still struggle to solve complex geometric problems. A key challenge stems from the lack of high-quality image-text pair datasets for understanding geometric images. Furthermore, most template-based data synthesis pipelines typically fail to generalize to questions outside their predefined templates. In this paper, we mitigate this issue by introducing a complementary RLHF process into the data generation pipeline. By adopting RAFT to adjust captions for image-text pairs generated from fewer than 50 templates and using reward signals derived from downstream mathematical problem-solving tasks, our pipeline successfully captures the key features of geometry problem-solving. This enables better task generalization and yields non-trivial improvements. Furthermore, the generated dataset also enhances the general mathematical reasoning capabilities of multimodal large language models beyond the domain of geometric mathematical problems, yielding accuracy improvements of 2.8%–5.3% in arithmetic, algebraic, and numerical tasks with even non-geometric input images.

Key Achievements

High Quality
GeoReasoning is a carefully constructed geometry captioning dataset consist of high quality image-caption pairs. It outperforms other counterpart datasets on downstream benchmarks and exhibits favorable scaling behaviour.

Experiments

We train base models on diverse mathematical datasets, including both captioning (AutoGeo, GeoPeP, and our proposed GeoReasoning) and reasoning (GeoGPT4V, Geo170K) datasets. Besides, we train base models on various sizes of these datasets. Then we evaluate these trained models on commonly-used mathematical benchmarks (MathVista and MathVerse).

Result 1: The model trained on GeoReasoning outperforms those on other datasets, indicating the high quality of GeoReasoning and its superior property to improve reasoning capacity.

Result 2: The models trained on various sizes of GeoReasoning exhibit obvious scaling behaviour.

Strong Generalization
The improvements brought by GeoReasoning are not limited to geometric tasks, but are also generalized to non-geometric mathematical tasks even non-mathematical domains like art and engineering.

Experiments

We train base models on GeoReasoning and evaluate the performance on MMMU to test its generalization property.

Result 1: The model trained on GeoReasoning outperforms the base model on various domains like art, science, engineering, etc., indicating that GeoReasoning can boost the reasoning capacity on diverse domains.

Result 2: The model trained on GeoReasoning outperforms those on other datasets, revealing the stronger generalization property of GeoReasoning.

Superior Scalability
The generated examples are composed from a pre-defined relation library with approximately 50 basic relations, allowing for the expansion to geometry problems of arbitrary complexity.

Generation Process

    We first pre-define a relation library composed of approximately 50 basic relations and the corresponding clause library.

    We then randomly sample relations and the corresponding clauses, after which we check the compatibility of them.

    In this way we can generate geometry problems of arbitrary complexity and difficulty.

    Since the generation is fast, we are able to efficiently and scalably generate a large volume of samples.

Experimental Results

Out-of-Domain Generalization

We evaluate the generalization property of GeoReasoning compared to the base model on MMMU.

mathvista
The accuracy of the base model and the model SFTed on GeoReasoning evaluated on MMMU

The model tuned on GeoReasoning outperforms the baseline on most of the domains, indicating the generalization capacity of the model as well as the generalizaiton property of the proposed dataset.

Table: Accuracy of the models trained on 10K samples of various datasets over multiple random seeds

Dataset Type MMMU
None N/A 43.3±0.7
AutoGeo Caption 43.5±0.5
GeoPeP Caption 43.7±0.6
GeoGPT4V QA 44.0±0.9
Geo170K QA 42.9±1.0
GeoReasoning Caption 44.9±0.7

The model trained on GeoReasoning outperforms those on other datasets, revealing its stronger generalization property.

In-Domain Improvements

We evaluate the in-domain performance of GeoReasoning compared to other commonly-used datasets on MathVista and MathVerse.

mathvista
MathVista
mathverse
MathVerse

The accuracy of models SFTed on diverse capacities and datasets evaluated on downstream benchmarks

The model trained on GeoReasoning improves progressively with increasing dataset size on downstream benchmarks in general. Besides, it exihibits a much superior scalability compared to other baselines.

Table: Accuracy of the models trained on 10K samples of various datasets over multiple random seeds

Dataset Type MathVerse MathVistae
AutoGeo Caption 24.5±0.4 47.6±0.5
GeoPeP Caption 24.2±0.2 47.7±0.4
GeoGPT4V QA 25.2±0.5 47.5±0.2
Geo170K QA 25.3±0.1 47.6±0.3
GeoReasoning Caption 26.0±0.3 48.8±0.2

The model trained on GeoReasoning has better reasoning ability compared to that trained on other caption datasets and even outperforms other multimodal reasoning datasets. This observation demonstrates a more pronounced ability of GeoReasoning to enhance the model's reasoning capabilities.

Methodology

Data Generation Pipeline

generation

The geometry data synthesis pipeline, where a graph-based representation similar to AutoGeo is employed for generating the final geometry images. The relation library comprises over 50 basic geometric relationships that can be composed into complex ones, providing comprehensive coverage for geometric problems of various difficulties. The image-caption pair is utilized for the SFT stage, while the caption-QA pair for the RLVR stage.

Training Pipeline

raft

The workflow of Geo-Image-Textualization method. In Stage 1, the model is trained to develop a preliminary ability to generate image captions. In Stage 2, an alternating optimization strategy is employed to jointly refine the generated captions and enhance the model's overall performance. The data of Stage 1 comes from the rule-based image-caption generation pipeline.

Reward Modeling

reward

The reward modeling includes two reward functions, the caption reward and the reasoning reward. The former is a weighted sum of the ROUGE and BLEU scores. For the latter, we construct a series of prompts by concatenating each pre-generated question with its corresponding candidate caption, after which we feed them into a language model to generate answers and compare against the ground truth to get the correctness reward.

Case Study

MathVista

Here are some samples on geometry, arithmetic, numeric, and scientific domains from MathVista.

These cases demonstrate that the geometry captioning task boosts the mathematical reasoning capacity of base models.

MMMU

Here are some samples on engineering, physics, and economics domains from MMMU.

These cases indicate that the model trained on GeoReasoning is more detailed and accurate in observing shape and develops superior spatial reasoning ability.

BibTeX


        @misc{georeasoning,
            title={Generalizable Geometric Image Caption Synthesis}, 
            author={Yue Xin and Wenyuan Wang and Rui Pan and Ruida Wang and Howard Meng and Shizhe Diao and Renjie Pi and Tong Zhang},
            year={2025},
            eprint={2509.15217},
            archivePrefix={arXiv},
            primaryClass={cs.AI},
            url={https://arxiv.org/abs/2509.15217}, 
        }