Multimodal Deep Learning-based Feature Fusion for Object Detection in Remote Sensing Images
Computer Science and Information Systems, Tome 22 (2025) no. 1.

Voir la notice de l'article provenant de la source Computer Science and Information Systems website

Object detection is an important computer vision task, which is developed from image classification task. The difference is that it is no longer only to classify a single type of object in an image, but to complete the classification and positioning of multiple objects that may exist in an image at the same time. Classification refers to assigning category labels to the object, and positioning refers to determining the vertex coordinates of the peripheral rectangular box of the object. Therefore, object detection is more challenging and has broader application prospects, such as automatic driving, face recognition, pedestrian detection, medical detection etc,. Object detection can also be used as the research basis for more complex computer vision task such as image segmentation, image description, object tracking and action recognition. In traditional object detection, the feature utilization rate is low and it is easy to be affected by other environmental factors. Hence, this paper proposes a multimodal deep learning-based feature fusion for object detection in remote sensing images. In the new model, cascade RCNN is the backbone network. Parallel cascade RCNN network is utilized for feature fusion to enhance feature expression ability. In order to solve the problem of different segmentation shapes and sizes, the central part of the network adopts multi-coefficient cascaded hollow convolution to obtain multi-receptive field features without using pooling mode and preserving image information. Meanwhile, an improved selfattention combined receptive field strategy is used to obtain both low-level features with marginal details and high-level features with global semantics. Finally, we conduct experiments on DOTA set including ablation experiments and comparison experiments. The experimental results show that the mean Average Precision (mAP) and other indexes have been greatly improved, and its performance is better than the state-of-the-art detection algorithms. It has a good application prospect in the remote sensing image object detection task.
Keywords: Object detection, remote sensing image, multimodal deep learning, feature fusion
@article{CSIS_2025_22_1_a15,
     author = {Shoulin Yin and Qunming Wang and Liguo Wang and Mirjana Ivanovi\'c and Hang Li},
     title = {Multimodal {Deep} {Learning-based} {Feature} {Fusion} for {Object} {Detection} in {Remote} {Sensing} {Images}},
     journal = {Computer Science and Information Systems},
     publisher = {mathdoc},
     volume = {22},
     number = {1},
     year = {2025},
     url = {http://geodesic.mathdoc.fr/item/CSIS_2025_22_1_a15/}
}
TY  - JOUR
AU  - Shoulin Yin
AU  - Qunming Wang
AU  - Liguo Wang
AU  - Mirjana Ivanović
AU  - Hang Li
TI  - Multimodal Deep Learning-based Feature Fusion for Object Detection in Remote Sensing Images
JO  - Computer Science and Information Systems
PY  - 2025
VL  - 22
IS  - 1
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/CSIS_2025_22_1_a15/
ID  - CSIS_2025_22_1_a15
ER  - 
%0 Journal Article
%A Shoulin Yin
%A Qunming Wang
%A Liguo Wang
%A Mirjana Ivanović
%A Hang Li
%T Multimodal Deep Learning-based Feature Fusion for Object Detection in Remote Sensing Images
%J Computer Science and Information Systems
%D 2025
%V 22
%N 1
%I mathdoc
%U http://geodesic.mathdoc.fr/item/CSIS_2025_22_1_a15/
%F CSIS_2025_22_1_a15
Shoulin Yin; Qunming Wang; Liguo Wang; Mirjana Ivanović; Hang Li. Multimodal Deep Learning-based Feature Fusion for Object Detection in Remote Sensing Images. Computer Science and Information Systems, Tome 22 (2025) no. 1. http://geodesic.mathdoc.fr/item/CSIS_2025_22_1_a15/