PVT-Net：基于多模态特征对比学习的点云场景语义分割网络

缪永伟; 朱广帅; 刘复昌; 刘海建; 严彩萍; 张旭东

doi:10.3724/SP.J.1089.2025-00428

PVT-Net：基于多模态特征对比学习的点云场景语义分割网络

PVT-Net: A Point Cloud Semantic Segmentation NetworkviaMultimodal Feature Contrastive Learning

摘要

摘要: 针对大规模点云场景语义分割中点云数据分布不规则、数据标注信息受限、分割模型泛化性弱等局限性，基于点云-视图-文本多模态特征对比学习，借助点云特征编码器和图像/文本特征编码器，提出一种室内点云场景语义分割网络PVT-Net，以实现高效的多模态数据特征对齐。首先利用场景2D视图与3D场景之间的投影矩阵从场景点云数据中获取多视角投影视图，并使用图像字幕生成模型根据多视角视图生成场景的文本标题；然后使用希尔伯特空间曲线将无序体素点云转换为适合Mamba建模的1D序列化结构，并保留其点云3D空间关系；再采用集成Mamba的3D U-Net结构提取点云场景局部特征，通过多级降采样提取场景点云数据全局特征，并通过上采样和跳跃连接恢复场景细节信息以融合点云局部-全局特征；采用预训练CLIP的图像编码器和文本编码器提取视图特征和文本嵌入，并使用适配器对齐点云-视图/文本特征将类别文本嵌入的权重加载到场景分割头中以实现点云场景语义分割；网络训练中，在语义分割损失的基础上增加二分类损失以平衡场景语义分割网络对基础类别和新类别的识别理解能力，最终实现室内场景的开放词汇语义分割。实验结果表明，针对S3DIS数据集，PVT-Net的hIoU指标达到45.1%，比传统RegionPLC和OpenScene的hIoU指标分别提高4.5和3.9个百分点；通过结合全局线性复杂度Mamba和局部U-Net建模，该网络具有很好的鲁棒性和有效性。

Abstract: To address the challenges of irregular data distribution, limited annotations, and poor generalization in the task of point cloud semantic segmentation, this paper proposes a multimodal point cloud semantic segmen-tation network (named PVT-Net) using point cloud-view-text contrastive learning. By leveraging point cloud, image, and text encoders, the proposed method achieves effective cross-modal feature alignment. Specifically, multi-view 2D projections are generated from 3D point clouds via projection matrices, and an image captioning model is employed to produce corresponding textual descriptions. The unordered voxelized point cloud is then serialized into a one-dimensional structure using a Hilbert space-filling curve, preserving spatial relationships while enabling efficient Mamba-based sequence modeling. A Mam-ba-enhanced 3D U-Net is adopted to extract features, where local geometric details are captured through the U-Net architecture, and global contextual information is modeled via hierarchical downsampling with Mamba modules. The decoder reconstructs spatial details through upsampling and skip connections to fuse local and global features. Meanwhile, a pre-trained CLIP model is used to extract visual and textual fea-tures. Feature adapters are introduced to align point cloud features with image and text embeddings in a shared space, and category-level textual embeddings are incorporated into the segmentation head to guide semantic prediction. During training, a binary classification loss is added to the segmentation loss to bal-ance the recognition of base and novel categories, enabling open-vocabulary semantic segmentation. Ex-periments on the S3DIS dataset show that PVT-Net achieves an hIoU of 45.1%, outperforming RegionPLC and OpenScene by 4.5% and 3.9%, respectively. By combining the global linear-complexity modeling ca-pability of Mamba with the local feature extraction strength of U-Net, the proposed method demonstrates strong robustness and effectiveness.

HTML全文

参考文献(0)

施引文献

资源附件(0)

英文长摘要