Logo VFIG

Vectorizing Complex Figures in SVG with Vision-Language Models

1University of Washington,  2Allen Institute for AI
*Equal contribution  Project lead
Teaser

Overview of Logo VFIG. Given complex raster images (top row) as input, VFIG generates editable, high-fidelity SVG code (pink box). Rendering the generated SVG (bottom row) produces outputs nearly indistinguishable from the inputs.

Abstract

Scalable Vector Graphics (SVG) are an essential format for technical illustration and digital design, offering precise resolution independence and flexible semantic editability. In practice, however, original vector source files are frequently lost or inaccessible, leaving only “flat” rasterized versions (e.g., PNG or JPEG) that are difficult to modify or scale. Manually reconstructing these figures is a prohibitively labor-intensive process, requiring specialized expertise to recover the original geometric intent. To bridge this gap, we propose VFIG, a family of Vision–Language Models trained for complex and high-fidelity figure-to-SVG conversion. While this task is inherently data-driven, existing datasets are typically small-scale and lack the complexity of professional diagrams. We address this by introducing VFIG-Data, a large-scale dataset of 66K high-quality figure–SVG pairs, curated from a diverse mix of real-world paper figures and procedurally generated diagrams. Recognizing that SVGs are composed of recurring primitives and hierarchical local structures, we introduce a coarse-to-fine training curriculum that begins with supervised fine-tuning (SFT) to learn atomic primitives and transitions to reinforcement learning (RL) refinement to optimize global diagram fidelity, layout consistency, and topological edge cases. Finally, we introduce VFIG-Bench, a comprehensive evaluation suite with novel metrics designed to measure the structural integrity of complex figures. VFIG achieves state-of-the-art performance among open-source models and performs on par with GPT-5.2, achieving a VLM-Judge score of 0.829 on VFIG-Bench.

VFIG Dataset

We curate VFIG-Data, a large-scale dataset of 66K rigorously filtered image–SVG pairs. Unlike prior SVG datasets that focus predominantly on icons or decorative graphics, VFIG-Data targets diagram-centric scientific figures. To our knowledge, it is the first dataset of this scale purpose-built for structured scientific figure generation.

Dataset Examples

Examples of VFIG-Data and academic data. We show the three sources: simple diagrams from academic datasets, complex diagram layouts, and a curated set of basic shapes and arrows to support structured SVG generation.

Generation Pipeline

Our pipeline covers three complementary data sources: (1) a data generation pipeline that produces complex diagrams via a VLM-based describe-and-generate approach from crawled images; (2) a rigorous filtering procedure to ensure quality and fidelity; and (3) 78K additional data points from existing academic SVG datasets, processed through the same filtering pipeline to improve generalization. Together, these form a diverse and high-quality training mixture.

Data generation and filtering pipelines. We show the data generation and filtering processes for curated academic figures, complex diagrams created through a VLM-based describe-and-generate pipeline from crawled images, and shapes and arrows produced by LLM-generated templates with randomized elements.

Training Mixture

data-overview

Summary of the training mixture.

VFIG Model

Input Figure
→
🧠 SFT Training

Learns SVG syntax and structure from paired figure–SVG supervision.

LSFT = āˆ’š”¼ log p(y|x)

Simple → complex curriculum

→
šŸŽÆ Reinforcement Learning

Improves visual fidelity using rewards on rendered SVG outputs.

R = (Pres + Lay + Conn + Det) / 4

Presence • Layout • Connectivity • Details

→
Generated SVG

Experimental Results

With Logo VFIG, we benchmark figure-to-SVG generation and highlight four key takeaways:

  1. Current VLMs struggle with faithful SVG generation. Classical raster-to-vector methods achieve high pixel similarity but produce noisy primitives, while open-source VLM baselines underperform on both visual fidelity and structural correctness.
  2. Coarse-to-fine curriculum SFT improves compositional stability. Two-stage training produces more consistent structures and higher judge scores across different backbone models.
  3. RL with visual feedback further boosts quality. Reinforcement learning consistently outperforms SFT-only across all evaluation metrics.
  4. Structure-aware rewards outperform pixel-level objectives. Structural rewards improve judge-based scores, whereas pixel-level losses may increase SSIM but degrade structural quality.

Full Comparison Across All Models and Metrics

Model VFIG-Bench Molmo2-Diagram SVG-Diagrams
SSIM↑ LPIPS↓ VisualSim↑ VLM-Judge↑ Clean↑ Render↑ SSIM↑ LPIPS↓ VisualSim↑ VLM-Judge↑ Clean↑ Render↑ SSIM↑ LPIPS↓ VisualSim↑ VLM-Judge↑ Clean↑ Render↑
Classical raster-to-vector
VTracer 0.9500.0920.9380.8380.0000.997 0.9420.1130.8860.7570.0001.000 0.8850.1300.9030.8060.0001.000
Closed-source VLMs
GPT-5.2 0.7270.3640.9570.8580.7310.995 0.7630.2830.9550.8940.7921.000 0.6060.3490.9360.7810.6880.984
Gemini-2.5-flash 0.7720.2580.9640.9130.7880.990 0.8280.1620.9650.9360.8330.992 0.6720.2450.9500.8930.6800.991
Gemini-2.5-pro 0.7560.3030.9640.9320.7870.902 0.7840.2440.9590.9290.8140.930 0.6370.3110.9430.8870.6690.945
Open-source VLMs
OmniSVG-4B 0.6950.6010.5050.0390.0000.819 0.7050.5450.5040.0960.0000.894 0.5860.5730.5690.0890.0000.875
StarVector-8B 0.6110.4160.8390.5820.6430.139 0.4960.4670.8090.5910.7280.146 0.6630.2970.8990.6590.4670.559
Qwen2.5-VL-4B 0.7080.5740.8570.4660.7940.476 0.7220.5120.8590.5400.7740.629 0.6140.5540.8050.4490.5910.495
Ours
Ours (SFT) 0.7630.2640.9510.7810.7840.884 0.7830.2260.9370.7760.8280.966 0.6330.3110.9070.6530.7100.939
Ours (SFT+RL) 0.7780.2120.9570.8290.8530.960 0.8000.1770.9490.8340.8550.976 0.6540.2670.9190.7050.7880.973

VisualSim: avg. cosine similarity of DINO, CLIP, and SigLIP embeddings.  |  VLM-Judge: mean score of Gemini and GPT judges.  |  Clean: SVG cleanliness.  |  Render: successful rendering rate.

Comparison with SOTA Methods

SVG generation examples on VFIG-Bench. Given the same input raster image, we compare the rendered SVG outputs produced by different methods. Our model more faithfully preserves the structure of the input diagram. P/L/C/D denote the Gemini judge scores for presence, layout, connectivity, and details.

Additional SVG generation examples on VFIG-Bench.

Failure cases of VFIG on VFIG-Bench.

BibTeX


      @misc{he2026vfigvectorizingcomplexfigures,
      title={VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models}, 
      author={Qijia He and Xunmei Liu and Hammaad Memon and Ziang Li and Zixian Ma and Jaemin Cho and Jason Ren and Daniel S Weld and Ranjay Krishna},
      year={2026},
      eprint={2603.24575},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.24575}, 
}