D-CLING: Prior-Preserving Depth-Conditioned Fine-Tuning for Navigation Foundation Models

Abstract

Navigation Foundation Models (NFMs) trained on large, cross-embodied datasets have demonstrated powerful generalizability across various scenarios. Adopting in-domain fine-tuning upon an NFM efficiently calibrates the visuomotor policy, promising further improvement even in novel scenarios. However, fine-tuned models still suffer from poor obstacle avoidance or fail to properly reach provided goals. Furthermore, such model updates on a small subset of data typically erode the pretrained prior, compromising pretraining generalization.

In this work, we present D-CLING (Depth-Conditioned, ControlNet-driven Learning for General Navigation Models), a novel fine-tuning method that leverages large-scale pretraining while efficiently learning in novel setups. Inspired by ControlNet, we fine-tune an NFM by attaching a trainable copy of the pretrained backbone using zero-initialized residual pathways, thereby learning geometric cues from dense depth maps. This design enables the model to efficiently acquire in-domain geometry while preserving pretrained knowledge across various behaviors.

shield

Prior-Preserving Fine-Tuning

Retains pretrained policy priors while explicitly injecting geometry-awareness through depth conditioning.

precision_manufacturing

Comprehensive Validation

Superior goal reachability and obstacle avoidance in both real-robot deployments and offline evaluations.

rocket_launch

Beyond Fine-Tuned Domain

The fine-tuned model extends navigation capability beyond the fine-tuning domain, enabling continuous learning.

Motivation

Why do Navigation Foundation Models need depth-conditioned fine-tuning?

Domain shift in geometric perception. Zero-shot NoMaD fails in real-robot navigation due to differences in camera configuration (e.g., fisheye vs. pinhole), leading to unsafe clearance and distance misestimation.

Trajectory diversity collapse after fine-tuning

Catastrophic forgetting after fine-tuning. Left: Zero-shot NoMaD generates diverse trajectories. Right: After fine-tuning, trajectory diversity collapses drastically, indicating loss of pretrained priors.

            Two Key Challenges
            Lack of geometry awareness — NFMs struggle with domain shift from camera configuration differences (field of view, distortion), leading to navigation failures in novel environments.
Catastrophic forgetting — Standard fine-tuning erodes the pretrained behavioral diversity, which is crucial for handling various navigational scenarios like junction selection and obstacle avoidance.

        

Method

ControlNet-inspired prior-preserving architecture: frozen RGB policy + trainable depth branch

Overview of D-CLING. A pretrained NoMaD RGB branch remains frozen, while a trainable depth branch learns geometric corrections. Short RGB history and goal image are encoded as in NoMaD, and depth-conditioned features are injected into intermediate layers through zero-initialized residual paths.

Model Architecture. D-CLING copies the pretrained backbone to form a Depth Branch. RGB-D input is first projected by a 4→3 embedding layer, then passed through a U-Net diffusion backbone. At each corresponding layer, depth features are fused into the frozen RGB branch with zero-initialized 1×1 convolutions.

            Method Overview
            Prior-preserving design: The pretrained RGB branch is frozen to keep NoMaD's original navigation behaviors.
Depth-conditioned adaptation: A trainable depth branch injects geometric cues through zero-initialized residual fusion, so adaptation starts safely and grows progressively.
Same policy objective, better geometry: D-CLING keeps the original action-prediction objective while improving obstacle-distance and clearance awareness under camera/domain shift.

        

Real-World Experiments

Evaluated on Toyota HSR robot across three challenging scenarios

(i) Basic Obstacle

Corridor traversal with visual avoidance of a stationary box.

(ii) Dynamic Corridor

After 10m travel, avoid an unmapped chair at corridor center.

(iii) Long-range

~50m trajectory across two junctions with scene dynamics.

Failure case of baselines. Baseline policies can fail to avoid obstacles robustly under camera/domain shifts.

Real-World Navigation Performance

Success rate (SR) for 10 trials each in (i)-(ii), average interventions of 5 trials for (iii)

Method	Training	Modality	(i) Basic Obstacle SR (%) ↑	(ii) Dynamic Corridor SR (%) ↑	(iii) Long-range Interventions ↓
NoMaD	Frozen	RGB	50	0	2.6
NoMaD-FT	Full fine-tune	RGB	30	10	3.2
NoMaD-EF	Early fusion	RGB-D	40	0	4.4
D-CLING (Ours)	Zero-init	RGB-D	70	60	1.2

Offline Evaluation

Action prediction accuracy across fine-tuning and pretrained domains

Benchmark on Offline Data

ADE / FDE / DTW — lower is better. Bold = best, underline = runner-up.

Method	F.T. Dataset RealHSRNav			Recon			GoStanford			Sacson			Scand
Method	ADE	FDE	DTW	ADE	FDE	DTW	ADE	FDE	DTW	ADE	FDE	DTW	ADE	FDE	DTW
NoMaD	1.326	2.160	0.917	1.691	2.996	1.301	2.267	4.448	1.888	2.508	4.285	2.003	2.035	2.990	1.283
NoMaD-FT	2.138	5.484	2.255	1.810	4.775	1.956	2.097	5.436	2.216	1.861	4.435	1.913	1.622	4.115	1.659
NoMaD-EF	1.897	4.244	1.674	2.222	4.676	1.991	2.540	6.246	2.592	2.737	5.497	2.455	2.428	5.063	2.045
D-CLING (Ours)	1.298	1.443	0.726	1.502	3.037	1.312	1.812	4.275	1.739	2.521	4.385	1.929	1.839	2.401	1.065

(a) RealHSRNav — Predicted waypoints closely match ground truth.

(b) GoStanford — D-CLING maintains accuracy even on the pretrained domain.

Key Finding: Beyond-Domain Generalization

D-CLING fine-tuned only on RealHSRNav achieves competitive or even better performance on the NoMaD pretraining datasets compared to zero-shot NoMaD. This demonstrates that our approach not only adapts to the target domain but also extends the pretrained model's capability, supporting continuous learning for general navigation.

Ablation: RGB vs. RGB-D Conditioning

Validating the benefit of depth conditioning in the ControlNet architecture

Single Obstacle

The robot traverses a corridor while avoiding a single stationary chair placed along its route. This scenario is similar to the real-world Dynamic Corridor and is represented in the training dataset.

Multi Obstacle Out-of-Distribution

The robot navigates through a 15×5 m² area with three obstacles placed at uniform intervals in an alternating left-right arrangement. This configuration is NOT in the training dataset.

RGB vs. RGB-D ControlNet Fine-Tuning

Success rate on obstacle avoidance scenarios (higher is better)

Method	Modality	(i) Single SR (%) ↑	(ii) Multi (OOD) SR (%) ↑
NoMaD-FT	RGB	60	10
D-CLING (Ours)	RGB-D	80	100

Depth Conditioning Enables Out-of-Distribution Generalization

The most dramatic improvement occurs in the out-of-distribution Multi Obstacle scenario (10% → 100%), where RGB-only fine-tuning fails due to delayed avoidance initiation. Depth conditioning strengthens geometric awareness and spatial reasoning, enabling earlier and more robust obstacle avoidance even in unseen obstacle configurations. This validates that our approach successfully transfers learned depth-awareness to novel scenarios.

Citation

If you find this work useful, please consider citing our paper.

@inproceedings{nakaoka2026dcling,
  title     = {D-CLING: Prior-Preserving Depth-Conditioned Fine-Tuning
               for Navigation Foundation Models},
  author    = {Shintaro Nakaoka and Takayuki Kanai and Kazuhito Tanaka},
  booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},
  year      = {2026}
}

Notification

The project page was solely developed for and published as part of the publication, titled “D-CLING: Prior-Preserving Depth-Conditioned Fine-Tuning for Navigation Foundation Models” for its visualization. We do not ensure the future maintenance and monitoring of this page. Contents might be updated or deleted without notice regarding the original manuscript update and policy change.

This webpage template was adapted from DiffusionNOCS — we thank Takuya Ikeda for additional support and making their source available.

D-CLING: Prior-Preserving Depth-ConditionedFine-Tuning for Navigation Foundation Models

Method Overview Video