Buy Crypto Markets Spot FuturesOIL(WTI)Earn Event Center

Under uniform Mask2Former frontends, we benchmark against Panoptic and Contrastive Lifting while evaluating 3DIML on the Replica-vMap and ScanNet datasets. While cutting neural field training iterations by 25×, 3DIML shows comparable accuracy using Scene Level Panoptic Quality and mIoU measures. Comparing 3DIML to Panoptic (5.7h) and Contrastive Lifting (3.5h), the former takes less than 20 minutes to finish scans on a single RTX 3090.Under uniform Mask2Former frontends, we benchmark against Panoptic and Contrastive Lifting while evaluating 3DIML on the Replica-vMap and ScanNet datasets. While cutting neural field training iterations by 25×, 3DIML shows comparable accuracy using Scene Level Panoptic Quality and mIoU measures. Comparing 3DIML to Panoptic (5.7h) and Contrastive Lifting (3.5h), the former takes less than 20 minutes to finish scans on a single RTX 3090.

Make Class-Agnostic 3D Segmentation Efficient with 3DIML

Author: Hackernoon

Source: Hackernoon

2025/10/24 23:34

9 min read

For feedback or concerns regarding this content, please contact us at [email protected]

Table of Links

Abstract and I. Introduction

II. Background

III. Method

IV. Experiments

V. Conclusion and References

IV. EXPERIMENTS

We benchmark our method against Panoptic and Contrastive Lifting using the same Mask2Former frontend. For fairness, we render semantics as in [5], [6] using the same multiresolution hashgrid for semantics and instances. For other experiments, we utilize GroundedSAM as our frontend and FastSAM for runtime performance critical tasks such as label merging and InstanceLoc.

\ A. Datasets

\ We evaluate our methods on a challenging subset of scans from Replica and ScanNet, which provide ground truth annotations. Since our methods are based on structure from motion techniques, we utilize the Replica-vMap [20] sequences, which is more indicative of real-world collected image sequences. For Replica and ScanNet, we avoid scans with various incompatibilities with our method (multi-room, low visibility, Nerfacto doesn’t converge) as well as those containing many close-up views of identical objects that easily confuse NetVLAD and LoFTR.

\ B. Metrics

\ For lifting panoptic segmentation, we utilize Scene Level Panoptic Quality [5], defined as the Panoptic Quality for the concatenated sequence of images. For Grounded SAM, especially with instance masks for smaller objects, there is a divergence in alignment with ground truth annotations. As such, we report mIoU for predicted, reference masks that have IoU > 0.5 over all frames (TP in Scene Level Panoptic Quality) as well as the number of such matched masks and the total number of reference masks.

\ C. Implementation Details

\ D. Results

\ Comparison with Panoptic and Contrastive Lifting: Table II shows the Scene Level Panoptic Quality for 3DIML and other methods on Replica-vMap sequences subsampled by 10 (200 frames). We observe 3DIML approaches Panoptic Lifting in performance while achieving a much larger practical runtime (considering implementation) than Panoptic and Contrastive Lifting. Intuitively, this is achieved by efficiently relying on implicit scene representation methods only at critical junctions i.e. post InstanceMap, greatly reducing the number of training iterations of the neural field (25x less). Figure 4 compares the instances identified by Panoptic Lifting to 3DIML.

\ We benchmark all runtimes using a single RTX 3090 post-mask generation. Specifically, comparing their implementation to ours, Panoptic Lifting requires 5.7 hours of training over all scans, with a min and max of 3.6 and 6.6 hours, respectively, since its runtime depends on the number of objects. Contrastive Lifting takes around 3.5 hours on average while 3DIML runs under 20 minutes (14.5 minutes on average) for all scans. Note several components of 3DIML can be easily parallelized, such as dense descriptors extraction using LoFTR and label merging. The runtime of our method is dependent on the number of correspondences produced by LofTR, which doesn’t change for different frontend segmentation models, and we observe similar runtimes for other experiments.

\ Fig. 4: Comparison between Panoptic Lifting and 3DIML for room0 from Replica-vMap

\ TABLE II: Runtime in minutes benchmarked on a single RTX 3090 of Panoptic Lifting, Contrastive Lifting, and 3DIML

\ Fig. 5: InstanceLift is able to fill in labels missed by InstanceMap as well as correct ambiguities. Here we show comparisons between them for office0 and room0 from Replica-vMap.

\ Grounded SAM Table III shows our results for lifting GroundedSAM masks for Replica-vMap. From Figure 5 we see that InstanceLift is effective at interpolating labels missed by InstanceMap and resolving ambiguities produced by GroundedSAM[1]. Figure 7 shows that InstanceMap and 3DIML are robust to large viewpoint changes and as well as duplicate objects, assuming nice scans, that is enough context for NetVLAD and LoFTR to somewhat distinguish between them. Table IV and Fig. 6 illustrate our performance on ScanNet [21].

\ Fig. 6: Some results for scans 0144 01, 0050 02, and 0300 01 from ScanNet [21] (one scene per row, top to bottom), showcasing how 3DIML accurately and consistently delineates instances in 3D.

\ Novel View Rendering and InstanceLoc Table V shows the performance if 3DIML on the second track provided in Replica-vMap. We observe that InstanceLift is effective at rendering new views, and therefore InstanceLoc performs well. For Replica-vMap and FastSAM, InstanceLoc takes on average 0.16s per localized image (6.2 frames per second). In addition, InstanceLoc can be applied as a post-processing step to the renders of the input sequence as a denoising operation.

\ E. Limitations and Future Work

\ In extreme viewpoint changes, our method sometimes produced discontinuous 3D instance labels. For example, on the worst performing scene, office2, we see that since the scan images only the front of a chair facing the back of the room and the back of a chair facing the front of the room for many frames, InstanceMap is not able to conclude these labels refer to the same object, and InstanceLift was unable to fix it, as NeRF’s correction performance rapidly degrades with increasing label inconsistency [12]. However, there are very few of these left per scene post 3DIML, and they can be easily fixed via sparse human annotation.

V. CONCLUSION

In this paper, we present 3DIML, which addresses the problem of 3D instance segmentation in a class-agnostic and computationally efficient manner. By employing a novel approach that utilizes InstanceMap and InstanceLift for generating and refining view-consistent pseudo instance masks from a sequence of posed RGB images, we circumvent the complexities associated with previous methods that only optimize a neural field. Furthermore, the introduction of InstanceLoc allows rapid localization of instances in unseen views by combining fast segmentation models and a refined neural label field. Our evaluations across Replica and ScanNet and different frontend segmentation models showcase *3DIML’*s speed and effectiveness. It offers a promising avenue for real-world applications requiring efficient and accurate scene analysis.

\ Fig. 7: Qualitative results on office3 and room1 from the Replica-vMap split [20]. Both InstanceMap and InstanceLift are able to maintain quality and consistency over the image sequence despite duplicate objects due to sufficient image context overlap across the sequence.

\ TABLE III: Quantitative (mIoU, TP) results for GroundedSAM frontend on Replica-vMap. The average number of reference instances for all Replica scenes we evaluated on is 67.

\ TABLE IV: Quantitative (mIoU, TP) results for GroundedSAM frontend on ScanNet. The average number of reference instances for all ScanNet scenes we evaluated on is 32.

\ TABLE V: Quantitative (mIoU, TP) results for InstanceLift and InstanceLoc on novel views over the Replica-vMap split [20].

\ Fig. 8: InstanceLoc is able to correct for noise rendered by InstanceLift.

\ Fig. 9: Our methods do not perform well in cases where the scan sequence contains only images of the different sides of an object (chair) or surface (floor) from differing directions without any smooth transitions in between, which occurs for office2 from Replica (vMap split [20]).

REFERENCES

[1] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” CoRR, vol. abs/2112.01527, 2021. [Online]. Available: https://arxiv.org/abs/2112.01527 1, 2

\ [2] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023. 1, 2, 3

\ [3] J. Schult, F. Engelmann, A. Hermans, O. Litany, S. Tang, and B. Leibe, “Mask3d: Mask transformer for 3d semantic instance segmentation,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 8216–8223. 1

\ [4] A. Takmaz, E. Fedele, R. W. Sumner, M. Pollefeys, F. Tombari, and F. Engelmann, “Openmask3d: Open-vocabulary 3d instance segmentation,” arXiv preprint arXiv:2306.13631, 2023. 1

\ [5] Y. Siddiqui, L. Porzi, S. R. Bulo, N. Muller, M. Nießner, A. Dai, and P. Kontschieder, “Panoptic lifting for 3d scene understanding with neural fields,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 9043–9052. 2, 3, 4, 5

\ [6] Y. Bhalgat, I. Laina, J. F. Henriques, A. Zisserman, and A. Vedaldi, “Contrastive lift: 3d object instance segmentation by slow-fast contrastive fusion,” 2023. 2, 4, 5

\ [7] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021. 2, 3

\ [8] X. Zhao, W. Ding, Y. An, Y. Du, T. Yu, M. Li, M. Tang, and J. Wang, “Fast segment anything,” arXiv preprint arXiv:2306.12156, 2023. 2, 4

\ [9] L. Ke, M. Ye, M. Danelljan, Y. Liu, Y.-W. Tai, C.-K. Tang, and F. Yu, “Segment anything in high quality,” arXiv preprint arXiv:2306.01567, 2023. 2

\ [10] Y. Zhang, X. Huang, J. Ma, Z. Li, Z. Luo, Y. Xie, Y. Qin, T. Luo, Y. Li, S. Liu, Y. Guo, and L. Zhang, “Recognize anything: A strong image tagging model,” 2023. 2

\ [11] S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu, and L. Zhang, “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” 2023. 2

\ [12] S. Zhi, T. Laidlow, S. Leutenegger, and A. J. Davison, “In-place scene labelling and understanding with implicit scene representation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15 838–15 847. 2, 5

\ [13] A. Chen, Z. Xu, A. Geiger, J. Yu, and H. Su, “Tensorf: Tensorial radiance fields,” in European Conference on Computer Vision (ECCV), 2022. 2

\ [14] B. Hu, J. Huang, Y. Liu, Y.-W. Tai, and C.-K. Tang, “Instance neural radiance field,” arXiv preprint arXiv:2304.04395, 2023. 2, 3

\ [15] ——, “Nerf-rpn: A general framework for object detection in nerfs,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23 528–23 538. 2

\ [16] P.-E. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk, “From coarse to fine: Robust hierarchical localization at large scale,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 716–12 725. 2

\ [17] J. Sun, Z. Shen, Y. Wang, H. Bao, and X. Zhou, “Loftr: Detectorfree local feature matching with transformers,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8922–8931. 2, 3

\ [18] M. Tancik, E. Weber, E. Ng, R. Li, B. Yi, T. Wang, A. Kristoffersen, J. Austin, K. Salahi, A. Ahuja et al., “Nerfstudio: A modular framework for neural radiance field development,” in ACM SIGGRAPH 2023 Conference Proceedings, 2023, pp. 1–12. 3, 4

\ [19] T. Muller, A. Evans, C. Schied, and A. Keller, “Instant neural ¨ graphics primitives with a multiresolution hash encoding,” ACM Transactions on Graphics, vol. 41, no. 4, p. 1–15, Jul. 2022. [Online]. Available: http://dx.doi.org/10.1145/3528223.3530127 3, 4

\ [20] X. Kong, S. Liu, M. Taher, and A. J. Davison, “vmap: Vectorised object mapping for neural field slam,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 952–961. 4, 6

\ [21] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” in Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017. 5

:::info Authors:

(1) George Tang, Massachusetts Institute of Technology;

(2) Krishna Murthy Jatavallabhula, Massachusetts Institute of Technology;

(3) Antonio Torralba, Massachusetts Institute of Technology.

:::

:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

[1] GroundedSAM produces lower quality frontend masks than SAM due to prompting using bounding boxes instead of a point grid.

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

Tags:

#DeFi