SurGe: Improved Surface Geometry in Point Maps

Preprint
1RWTH Aachen University 2Eindhoven University of Technology

Abstract

Recent feedforward 3D reconstruction methods predict point maps and estimate global 3D geometry remarkably well. However, their predictions still exhibit inaccurate local surface geometry, which is clearly visible qualitatively but only weakly reflected in common metrics. To make these errors more explicit in evaluation, we introduce a point map normal metric that evaluates the local surface orientation induced by neighboring 3D predictions. To reduce these errors, we propose two complementary components: a point gradient matching loss that supervises depth-normalized 3D finite differences, and a Neighborhood Attention Decoder (NAD) that progressively upsamples features and uses Neighborhood Attention for local feature mixing. Across eight zero-shot monocular geometry benchmarks, SurGe achieves the best average rank for global point map \(\text{AbsRel}\) and consistently improves local point map and point map normal evaluations.

Qualitative Results

SurGe point map example

SurGe

MoGe-2 point map example

MoGe-2

InfiniDepth point map example

InfiniDepth

Point map comparison. SurGe predicts cleaner local geometry and preserves thin structures more faithfully than recent monocular point map models.
SurGe point map normals example

SurGe

MoGe-2 point map normals example

MoGe-2

InfiniDepth point map normals example

InfiniDepth

Point map normal comparison. The normal maps show fewer ripples and local surface distortions for SurGe.

SurGe Architecture

SurGe architecture overview
SurGe architecture overview. SurGe combines a DINOv2 encoder with our Neighborhood Attention Decoder (NAD), which progressively upsamples features with local Neighborhood Attention blocks.

Decoder Comparison

ConvStack-L decoder qualitative result
NAD decoder qualitative result
NAD (SurGe)
ConvStack-L
Qualitative decoder ablation. Move the slider to compare our Neighborhood Attention Decoder (NAD) with other common decoder heads trained in the same setup. Parentheses indicate the model that uses the corresponding head style; ConvStack-L is a ConvStack that we scaled up to match the depth and width of our NAD.

BibTeX

@article{knaebel2026surge,
    title     = {{SurGe}: Improved Surface Geometry in Point Maps},
    author    = {Knaebel, Karim and Martin Garcia, Gonzalo and Schmidt, Christian and Fradlin, Ilya and Nunes, Lucas and de Geus, Daan and Leibe, Bastian},
    year      = 2026,
    journal   = {arXiv preprint arXiv:2605.31577},
}