Figure1: Detailed network structure of our proposed VTF-NET
Figure2: Visual representation of the VTFE structure.
The hardware configuration consisted of a desktop system equipped with two NVIDIA 3080 GPUs, an Intel E5-2690V4 CPU, and 256 GB of RAM. The software environment was constituted of Python 3.9, PyTorch 2.0.0, and CUDA 11.8, with the training framework being realized through PyTorch's DistributedDataParallel (DDP) implementation.
Datasets | Quantity | Training Set | Validation Set | Testing Set |
---|---|---|---|---|
CMED-18k | 10000 | 7200 | 800 | 2000 |
We provide GitHub links pointing to the PyTorch implementation code for all networks compared in this experiment here, so you can easily reproduce all these projects.
UNet;FCN8s; SegNet; PSPNet; ENet; ICNet; UNet+AttGate DANet; LEDNet; DUNet; CENet; CGNet; OCNet; GCN,
Table1: The results of segmentation performance of the proposed method against 14 baseline models, evaluated on the CMED-18K dataset. Metrics include dice coefficient, HD, HD95, NCC, and Kappa statistic. The highest performance values for each metric are highlighted in red, with the second highest marked in blue.
Figure4: Illustration of results between VTF-Net and 14 baseline models. The first row presents the original input images, followed by corresponding results, including zoomed-in views of edema regions to highlight segmentation detail.
All experiments were executed under identical conditions, and the results are detailed in Table1 and Figure2. VTF-Net showed competitive results across various evaluation metrics.
Table2: Ablation study results for the VTF-Net architecture, comparing the impact of individual modules—VTFE, MSAF, AFRP, and EFRE—on segmentation performance across multiple metrics, including Dice coefficient, HD, HD95, NCC, and Kappa. The highest performance values are highlighted in red, while the second-highest are marked in blue, demonstrating the relative contributions of each module to the overall network efficacy.
Table3: Ablation study results for various attention fusion strategies within the MSAF module, illustrating their differential impacts on segmentation performance across multiple quantitative metrics. The CA and EMA attention mechanisms represent Coordinate Attention and Efficient Multi-Scale Attention, respectively, while CB denotes Convolutional Block Attention Module. FFT refers to the Fast Fourier Transform, LSK indicates a Large Selective Kernel Network, and CA EMA signifies the serial concatenation of CA and EMA outputs. The configuration labeled FFT + CA EMA demonstrates a parallel fusion of FFT and CA EMA outputs, and FSA represents a frequency split attention strategy. Red text highlights the highest values, whereas the second-highest scores are marked in blue, signifying performance optima across configurations.
Table4: Detailed results conducted on the VTFE module, evaluating the impact of variations in architectural parameters number of layers, kernel sizes, and convolutional blocks on segmentation metrics. Multiple configurations were tested to determine the optimal combination of these parameters, with the highest metric values marked in red and the second-highest in blue. The configuration employing 4 layers, 5x5 kernels, and 2 convolutional blocks demonstrated the most favorable performance, indicating the importance of deeper feature hierarchies and larger receptive fields for capturing complex patterns in retinal OCT segmentation.
If you have any question, please concat '[email protected].