PVT-Net: A Point Cloud Semantic Segmentation NetworkviaMultimodal Feature Contrastive Learning

MIAO YongWei; Zhu Guangshuai; LIU Fuchang; Liu Haijian; Yan Caiping; ZHANG

doi:10.3724/SP.J.1089.2025-00428

MIAO YongWei, Zhu Guangshuai, LIU Fuchang, Liu Haijian, Yan Caiping, ZHANG . PVT-Net: A Point Cloud Semantic Segmentation NetworkviaMultimodal Feature Contrastive LearningJ. Journal of Computer-Aided Design & Computer Graphics. DOI: 10.3724/SP.J.1089.2025-00428

Citation:

PVT-Net: A Point Cloud Semantic Segmentation NetworkviaMultimodal Feature Contrastive Learning

Graphical Abstract

Abstract

Abstract

To address the challenges of irregular data distribution, limited annotations, and poor generalization in the task of point cloud semantic segmentation, this paper proposes a multimodal point cloud semantic segmen-tation network (named PVT-Net) using point cloud-view-text contrastive learning. By leveraging point cloud, image, and text encoders, the proposed method achieves effective cross-modal feature alignment. Specifically, multi-view 2D projections are generated from 3D point clouds via projection matrices, and an image captioning model is employed to produce corresponding textual descriptions. The unordered voxelized point cloud is then serialized into a one-dimensional structure using a Hilbert space-filling curve, preserving spatial relationships while enabling efficient Mamba-based sequence modeling. A Mam-ba-enhanced 3D U-Net is adopted to extract features, where local geometric details are captured through the U-Net architecture, and global contextual information is modeled via hierarchical downsampling with Mamba modules. The decoder reconstructs spatial details through upsampling and skip connections to fuse local and global features. Meanwhile, a pre-trained CLIP model is used to extract visual and textual fea-tures. Feature adapters are introduced to align point cloud features with image and text embeddings in a shared space, and category-level textual embeddings are incorporated into the segmentation head to guide semantic prediction. During training, a binary classification loss is added to the segmentation loss to bal-ance the recognition of base and novel categories, enabling open-vocabulary semantic segmentation. Ex-periments on the S3DIS dataset show that PVT-Net achieves an hIoU of 45.1%, outperforming RegionPLC and OpenScene by 4.5% and 3.9%, respectively. By combining the global linear-complexity modeling ca-pability of Mamba with the local feature extraction strength of U-Net, the proposed method demonstrates strong robustness and effectiveness.

FullText(HTML)

References (0)

Cited By

Extended English Abstract

Turn off MathJax

Article Contents

PVT-Net: A Point Cloud Semantic Segmentation NetworkviaMultimodal Feature Contrastive Learning

Abstract

Catalog

Export File

Citation

Format

Content