Project page | Our laboratory home page
by Yaoqi Sun, Zhaoyang Xu, Zhao Huang, Gaopeng Huang, Bin Wan, Haibing Yin, Jin Liu, Zhiwen Zheng, Xiaoshuai Zhang, and Xingru Huang
Hangzhou Dianzi University IMOP-lab
RONet is a novel encoder-decoder network specifically designed for RGB-T salient object detection. This model effectively leverages the complementary information from RGB and thermal images, achieving high detection accuracy even in low-light environments or scenes with complex backgrounds.
The key features of RONet include:
- Recurrent Optimization Module (RO): Iteratively refines RGB and thermal features through cross-modal guidance.
- Multi-Scale Semantic Module (MS): Utilizes a multi-branch structure to extract spatial and semantic information.
- Detail Enhancement Module (DE): Employs multi-branch dilated convolutions to optimize low-level features and capture spatial details.
Figure 1: Network Architecture. The overall architecture of RONet, comprising the encoder, Recurrent Optimization Module (RO), Multi-Scale Semantic Module (MS), Detail Enhancement Module (DE), and the final decoding process.
RONet adopts a typical encoder-decoder network structure, consisting of the following components:
- Encoder: Extracts multi-level features from RGB and thermal images using the Swin Transformer backbone.
- Recurrent Optimization Module (RO): Iteratively optimizes features through cross-modal guidance, improving the fusion quality of RGB and thermal data.
- Decoder:
- Multi-Scale Semantic Module (MS): Extracts semantic and positional information from high-level features.
- Detail Enhancement Module (DE): Enhances low-level features and captures spatial boundary details.
- Output: Generates high-quality saliency maps.
Figure 2: Recurrent Optimization Module (RO). The detailed structure of the RO module, illustrating how RGB and thermal features are iteratively optimized using channel and spatial attention mechanisms for effective feature fusion.
The Recurrent Optimization Module refines RGB and thermal features through iterative cross-modal guidance, enhancing feature quality. The process includes:
- Refining RGB features using a Channel Attention (CA) block.
- Optimizing thermal features using the refined RGB features.
- Enhancing RGB features further using the optimized thermal features.
- Fusing the two modalities to generate high-quality multimodal features.
Figure 3: Multi-Scale Semantic Module (MS). The structure of the MS module, incorporating three branches to extract spatial information, preserve semantic features, and enhance positional information.
The Multi-Scale Semantic Module is designed to extract semantic and positional information. It includes three branches:
- Upsampling + Convolution + Downsampling branch: Extracts spatial information.
- Multi-Layer Perceptron (MLP) branch: Preserves the original semantic information of features.
- Downsampling + Convolution + Upsampling branch: Enhances positional information.
The outputs of these branches are fused and further refined using attention mechanisms.
Figure 4: Detail Enhancement Module (DE). The DE module design utilizes multi-branch convolutional layers with different dilation rates to progressively optimize low-level features and capture fine spatial details.
The Detail Enhancement Module focuses on refining the spatial and boundary details of the target. Its design includes:
- Multi-branch convolutions with different dilation rates (1, 2, 4, 6).
- Progressive feature refinement using Channel Attention (CA) for branch-wise optimization.
- Fusion of all branch outputs to generate the final enhanced features.
- Python 3.8+
- PyTorch 1.9+
- CUDA 11.1+
- Additional dependencies:
torchvision
,numpy
,matplotlib
, etc.
-
Clone the repository:
git clone https://github.com/IMOP-lab/RONet.git cd RONet
-
Install dependencies:
pip install -r requirements.txt
-
Prepare datasets (e.g., VT5000, VT1000, VT821) and organize them as follows:
datasets/ ├── VT5000/ │ ├── train/ │ ├── test/ ├── VT1000/ ├── VT821/
Train the model on the VT5000 dataset:
python train.py --dataset VT5000 --batch_size 16 --epochs 50
Evaluate the model on the VT1000 dataset:
python test.py --dataset VT1000 --weights checkpoints/RONet.pth
Table 1: Quantitative comparison of different methods for RGB-T salient object detection. The table summarizes the performance of various methods in terms of parameters (#Param), computational complexity (FLOPs), frame rate (FPS), and evaluation metrics (MAE, Fβ, Sa, and Eξ) across three datasets (VT5000, VT1000, VT821). Bold values indicate the best performance for each metric.
Methods | #Param (M) ↓ | FLOPs (G) ↓ | FPS ↑ | VT5000 (M ↓) | VT5000 (Fβ ↑) | VT5000 (Sa ↑) | VT5000 (Eξ ↑) | VT1000 (M ↓) | VT1000 (Fβ ↑) | VT1000 (Sa ↑) | VT1000 (Eξ ↑) | VT821 (M ↓) | VT821 (Fβ ↑) | VT821 (Sa ↑) | VT821 (Eξ ↑) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MIDD | 52.4 | 216.7 | 42.2 | .046 | .788 | .856 | .893 | .029 | .870 | .907 | .935 | .045 | .803 | .871 | .897 |
MMNet | 64.1 | 42.5 | 38.5 | .044 | .777 | .863 | .888 | .027 | .859 | .914 | .931 | .040 | .792 | .874 | .893 |
CSRNet | 1.0 | 4.4 | 46.4 | .042 | .809 | .868 | .907 | .024 | .875 | .918 | .939 | .038 | .829 | .885 | .912 |
ECFFNet | - | - | - | .038 | .803 | .875 | .911 | .022 | .873 | .924 | .947 | .035 | .807 | .877 | .907 |
TAGFNet | 36.2 | 115.1 | 39.5 | .036 | .826 | .884 | .916 | .021 | .888 | .926 | .951 | .035 | .821 | .881 | .909 |
APNet | 30.4 | 46.6 | 50.1 | .035 | .816 | .876 | .917 | .022 | .880 | .922 | .950 | .034 | .814 | .868 | .911 |
CGFNet | 69.9 | 347.8 | 52.3 | .035 | .851 | .883 | .924 | .024 | .901 | .921 | .953 | .036 | .842 | .879 | .915 |
TNet | 87.0 | 39.7 | 49.6 | .033 | .845 | .895 | .929 | .021 | .887 | .929 | .952 | .030 | .840 | .899 | .924 |
MCFNet | 70.8 | 40.8 | 34.2 | .033 | .846 | .888 | .928 | .019 | .900 | .932 | .960 | .029 | .842 | .891 | .925 |
CCFENet | 28.7 | 17.1 | 56.0 | .031 | .858 | .896 | .937 | .018 | .904 | .934 | .962 | .027 | .856 | .900 | .933 |
CAVER | 93.8 | 31.6 | 28.5 | .029 | .854 | .900 | .940 | .017 | .904 | .938 | .965 | .027 | .853 | .898 | .934 |
SwinNet* | 198.8 | 124.7 | 30.9 | .026 | .864 | .912 | .945 | .018 | .894 | .938 | .957 | .030 | .846 | .904 | .928 |
WaveNet | 30.2 | 26.7 | 7.7 | .026 | .863 | .912 | .944 | .015 | .905 | .945 | .967 | .025 | .854 | .912 | .935 |
HRTransNet* | 58.9 | 17.3 | 14.9 | .025 | .869 | .913 | .948 | .017 | .898 | .938 | .956 | .026 | .851 | .906 | .933 |
UidefNet* | - | - | - | .025 | .874 | .914 | .948 | .015 | .911 | .941 | .969 | .023 | .872 | .917 | .945 |
SPNet* | 110.0 | 67.8 | 14.1 | .024 | .885 | .915 | .952 | .015 | .918 | .942 | .972 | .023 | .874 | .914 | .943 |
Ours* | 87.1 | 90.8 | 9.2 | .021 | .886 | .923 | .955 | .014 | .913 | .946 | .967 | .025 | .856 | .914 | .931 |
Experimental results demonstrate that RONet significantly outperforms existing methods on multiple public RGB-T datasets, particularly excelling in accuracy and robustness under low-light or complex scenarios.
Figure 5: Visual comparison of RGB-T salient object detection results. Each row shows the input RGB image (a), thermal image (b), and saliency maps generated by various methods (c-t). The proposed method produces more accurate saliency maps under challenging conditions, effectively capturing target shapes and boundaries.
This project is licensed under the MIT License. For details, please refer toThe MIT License – Open Source Initiative。
For inquiries or collaborations, please contact:
- Yaoqi Sun ([email protected])
- Gaopeng Huang ([email protected])