Abstract:RGB-D semantic segmentation has been extensively studied and achieved remarkable results. However, traditional methods use simple strategies to fuse RGB and depth features, which struggle to make full use of multimodal information. In addition, most current methods use dual-stream Transformer to extract information, leading to a substantial increase in the number of parameters, which in turn hinders their practical application. To solve this problem, this paper designs an RGB-D semantic segmentation network based on Transformer and CNN, uses Mix Transformer to extract depth features, and ConvText to extract RGB features. In order to effectively utilize RGB and depth information, a feature interaction complementation module is designed to realize the interaction and correction of RGB and depth information in the feature extraction process. For the purpose of seamlessly integrating RGB and depth features, an asymmetric feature selection fusion module is proposed to achieve effective fusion of RGB features and depth features. A large number of experimental results on the NYU Depth V2 and SUN RGB-D datasets show that this method can effectively achieve fast and effective segmentation of complex indoor scenes.