Cross-View Meets Diffusion: Aerial Image Synthesis With Geometry And Text Guidance

1Vermont Artificial Intelligence Lab, Department of Computer Science, University of Vermont
2Intelligent Machines Lab, Information Technology University
3Center for Research in Computer Vision, University of Central Florida

WACV 2025

Equal contribution, *Corresponding and senior author

Abstract

Frequent high-quality aerial images are not always accessible due to their high effort and cost requirements. One solution is to use Ground-to-Aerial (G2A) image synthesis to generate aerial images from easily collectible ground images. G2A is rarely studied due to various challenges like the drastic view changes, occlusion, and limited range of visibility. This paper presents a novel Geometric Preserving Ground-to-Aerial Image Synthesis (GPG2A) model that can generate realistic aerial images from ground images. GPG2A consists of two stages, the first stage predicts the Bird’s Eye View (BEV) segmentation (referred to as the BEV layout map) from the ground image. The second stage synthesizes the aerial image from the predicted BEV layout map and text descriptions of the ground image. To train our model, we present a new multi-modal cross-view dataset, namely VIGORv2, built upon VIGOR. VIGORv2 introduces newly collected aerial images, layout maps, and text descriptions. Our experiments illustrate that GPG2A synthesizes better geometry-preserved aerial images than existing models. We also present two applications, data augmentation for cross-view geo-localization and sketch-based region search, to further verify the effectiveness of our GPG2A.

Motivations 💡

⭐ Aerial images offer high-resolution, detailed views that are valuable across various applications, unlike lower-resolution satellite images that are often obscured by clouds

⭐ Current aerial images are limited by the high effort and cost required to capture them, often captured by Unmanned Aerial Vehicles (UAVs) or drones

⭐ Ground images are far more available and cost-effective, especially in the recent advanced cars and autonomous vehicles, along with crowdsourcing platforms which get tons of daily uploads of street-view images

⭐ Thus, a promising cost-effective solution for aerial image collection is ground-to-aerial (G2A) image synthesis, which aims to generate more frequent aerial images from their corresponding ground views

Model Overview

GPG2A features a two-stage process: The first stage transforms the input ground image into a Bird’s Eye View (BEV) layout map estimate. The second stage leverages a pre-trained diffusion model (ControlNet), conditioned on the predicted BEV layout map from the first stage, to generate photo-realistic aerial images

Model architecture
GPG2A architecture

Why Two stages? 🤔

⭐ The problem is simplified! reducing the domain gap between aerial and ground views

⭐ The BEV layout map explicitly preserves geometry correspondence between the views

⭐ Leverage strong pre-trained diffusion foundation models (stage II)

Why add text? 🤔

⭐ To further improve the synthesis quality and fuse surrounding information not fully represented in the BEV layout map! Such as block types (e.g., commercial or residential)

VIGORv2 🗺️

VIGORv2 includes center-aligned aerial-ground image pairs, layout maps, and text descriptions of ground images. We cover 4 major US cities and apply a geographical train-test split as shown below

VIGORv2 geographic train-test split
VIGORv2 geographic train-test split
VIGORv2 samples
VIGORv2 NewYork sample

Synthesis Results 🏞️

We benchamark our model on VIGORv2 using the same- and cross-area protocols

Results Samples
Same-area sample results
Results Samples
Cross-area sample results

Application: Sketch-based Region Search 🕵️‍♂️

Ever struggled to locate that perfect spot you can picture in your mind but just can’t find? Using GPG2A, you can retrieve areas-of-interest from a rough hand-drawn sketch (mind-map) and a simple text description (prompt)

✏️ Sketch It: Draw what you have in mind

🗣️ Describe It: Add a few words about the area

🔎 Discover It: GPG2A synthesizes a fake aerial image, and we find the closest match from a database of real aerial images

Results Samples
Sketch-based retrieval

NOTE: You can use the Huggingface demo provided in the title to generate fake images from sketches