Open‑YOLO 3D replaces costly SAM/CLIP steps with 2D detection, LG label‑maps, and parallelized visibility, enabling fast and accurate 3D OV segmentation.Open‑YOLO 3D replaces costly SAM/CLIP steps with 2D detection, LG label‑maps, and parallelized visibility, enabling fast and accurate 3D OV segmentation.

Drop the Heavyweights: YOLO‑Based 3D Segmentation Outpaces SAM/CLIP

2025/08/26 16:20

Abstract and 1 Introduction

  1. Related works
  2. Preliminaries
  3. Method: Open-YOLO 3D
  4. Experiments
  5. Conclusion and References

A. Appendix

3 Preliminaries

Problem formulation: 3D instance segmentation aims at segmenting individual objects within a 3D scene and assigning one class label to each segmented object. In the open-vocabulary (OV) setting, the class label can belong to previously known classes in the training set as well as new class labels. To this end, let P denote a 3D reconstructed point cloud scene, where a sequence of RGB-D images was used for the reconstruction. We denote the RGB image frames as I along with their corresponding depth frames D. Similar to recent methods [35, 42, 34], we assume that the poses and camera parameters are available for the input 3D scene.

\

3.1 Baseline Open-Vocabulary 3D Instance Segmentation

We base our approach on OpenMask3D [42], which is the first method that performs open-vocabulary 3D instance segmentation in a zero-shot manner. OpenMask3D has two main modules: a class-agnostic mask proposal head, and a mask-feature computation module. The class-agnostic mask proposal head uses a transformer-based pre-trained 3D instance segmentation model [39] to predict a binary mask for each object in the point cloud. The mask-feature computation module first generates 2D segmentation masks by projecting 3D masks into views in which the 3D instances are highly visible, and refines them using the SAM [23] model. A pre-trained CLIP vision-language model [55] is then used to generate image embeddings for the 2D segmentation masks. The embeddings are then aggregated across all the 2D frames to generate a 3D mask-feature representation.

\ Limitations: OpenMask3D makes use of the advancements in 2D segmentation (SAM) and vision-language models (CLIP) to generate and aggregate 2D feature representations, enabling the querying of instances according to open-vocabulary concepts. However, this approach suffers from a high computation burden leading to slow inference times, with a processing time of 5-10 minutes per scene. The computation burden mainly originates from two sub-tasks: the 2D segmentation of the large number of objects from the various 2D views, and the 3D feature aggregation based on the object visibility. We next introduce our proposed method which aims at reducing the computation burden and improving the task accuracy.

\

4 Method: Open-YOLO 3D

Motivation: We here present our proposed 3D open-vocabulary instance segmentation method, Open-YOLO 3D, which aims at generating 3D instance predictions in an efficient strategy. Our proposed method introduces efficient and improved modules at the task level as well as the data level. Task Level: Unlike OpenMask3D, which generates segmentations of the projected 3D masks, we pursue a more efficient approach by relying on 2D object detection. Since the end target is to generate labels for the 3D masks, the increased computation from the 2D segmentation task is not necessary. Data Level: OpenMask3D computes the 3D mask visibility in 2D frames by iteratively counting visible points for each mask across all frames. This approach is time-consuming, and we propose an alternative approach to compute the 3D mask visibility within all frames at once.

\

4.1 Overall Architecture

\

4.2 3D Object Proposal

\

4.3 Low Granularity (LG) Label-Maps

\

4.4 Accelerated Visibility Computation (VAcc)

In order to associate 2D label maps with 3D proposals, we compute the visibility of each 3D mask. To this end, we propose a fast approach that is able to compute 3D mask visibility within frames via tensor operations which are highly parallelizable.

\ Figure 3: Multi-View Prompt Distribution (MVPDist). After creating the LG label maps for all frames, we select the top-k label maps based on the 2D projection of the 3D proposal. Using the (x, y) coordinates of the 2D projection, we choose the labels from the LG label maps to generate the MVPDist. This distribution predicts the ID of the text prompt with the highest probability.

\

\

\

4.5 Multi-View Prompt Distribution (MVPDist)

\ Table 1: State-of-the-art comparison on ScanNet200 validation set. We use Mask3D trained on the ScanNet200 training set to generate class-agnostic mask proposals. Our method demonstrates better performance compared to those that generate 3D proposals by fusing 2D masks and proposals from a 3D network (highlighted in gray in the table). It outperforms state-of-the-art methods by a wide margin under the same conditions using only proposals from a 3D network.

\

4.6 Instance Prediction Confidence Score

\

:::info Authors:

(1) Mohamed El Amine Boudjoghra, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) ([email protected]);

(2) Angela Dai, Technical University of Munich (TUM) ([email protected]);

(3) Jean Lahoud, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) ( [email protected]);

(4) Hisham Cholakkal, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) ([email protected]);

(5) Rao Muhammad Anwer, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) and Aalto University ([email protected]);

(6) Salman Khan, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) and Australian National University ([email protected]);

(7) Fahad Shahbaz Khan, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) and Australian National University ([email protected]).

:::


:::info This paper is available on arxiv under CC BY-NC-SA 4.0 Deed (Attribution-Noncommercial-Sharelike 4.0 International) license.

:::

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

French Lender Offers Crypto To Millions

French Lender Offers Crypto To Millions

The post French Lender Offers Crypto To Millions appeared on BitcoinEthereumNews.com. They say journalists never truly clock out. But for Christian, that’s not just a metaphor, it’s a lifestyle. By day, he navigates the ever-shifting tides of the cryptocurrency market, wielding words like a seasoned editor and crafting articles that decipher the jargon for the masses. When the PC goes on hibernate mode, however, his pursuits take a more mechanical (and sometimes philosophical) turn. Christian’s journey with the written word began long before the age of Bitcoin. In the hallowed halls of academia, he honed his craft as a feature writer for his college paper. This early love for storytelling paved the way for a successful stint as an editor at a data engineering firm, where his first-month essay win funded a months-long supply of doggie and kitty treats – a testament to his dedication to his furry companions (more on that later). Christian then roamed the world of journalism, working at newspapers in Canada and even South Korea. He finally settled down at a local news giant in his hometown in the Philippines for a decade, becoming a total news junkie. But then, something new caught his eye: cryptocurrency. It was like a treasure hunt mixed with storytelling – right up his alley! So, he landed a killer gig at NewsBTC, where he’s one of the go-to guys for all things crypto. He breaks down this confusing stuff into bite-sized pieces, making it easy for anyone to understand (he salutes his management team for teaching him this skill). Think Christian’s all work and no play? Not a chance! When he’s not at his computer, you’ll find him indulging his passion for motorbikes. A true gearhead, Christian loves tinkering with his bike and savoring the joy of the open road on his 320-cc Yamaha R3. Once a speed demon who hit…
Share
BitcoinEthereumNews2025/12/09 12:01
MegaETH to launch Frontier mainnet beta next week

MegaETH to launch Frontier mainnet beta next week

The post MegaETH to launch Frontier mainnet beta next week appeared on BitcoinEthereumNews.com. MegaETH is moving into a new phase of development with a planned launch of its Frontier mainnet beta to builders. Summary MegaETH will open Frontier mainnet beta to developers next week. The month-long beta focuses on stability testing, early app deployment, and real-time performance trials. Recent bridge issues were resolved through full refunds as the network prepares for a full mainnet launch in early 2026. MegaETH is preparing to open its mainnet beta, known as Frontier, to developers next week. A Dec. 8 update on X confirmed that infrastructure teams have already started deploying to the network. The team said it will now move into a staged rollout that supports builders first, followed by wider application testing and phased user onboarding in the weeks ahead. Frontier enters its month-long beta Frontier is the final step before MegaETH’s full public mainnet and is structured as a one-month beta beginning in early December. The phase is tailored for developers, early adopters, and teams that want to test real-time execution features such as sub-millisecond latency, in-memory processing, and just-in-time compilation for smart contracts. We open Frontier to app builders next week. Infrastructure teams have already deployed on mainnet, with many more arriving in the coming days. We will then spend the weeks that follow supporting applications on deploying and testing ahead of user onboarding. OMEGA pic.twitter.com/C5ZxY5rKRH — MegaETH (@megaeth) December 8, 2025 MegaETH has described this period as a stability-first stage with no incentives, where brief downtime is expected as performance limits are pushed. The project wants builders to run their applications in conditions close to a live environment. It is also the point where the team gathers feedback from curated partners and infrastructure providers already moving onto the network. If the testing window proceeds on schedule, the full mainnet launch could open…
Share
BitcoinEthereumNews2025/12/09 12:02