Introduces O3D-SIM, an Open-set 3D Semantic Instance Map that uses CLIP/DINO embeddings for robust Vision-Language NavigationIntroduces O3D-SIM, an Open-set 3D Semantic Instance Map that uses CLIP/DINO embeddings for robust Vision-Language Navigation

3D Semantic Instance Maps: Leveraging Foundation Models for Language-Guided Navigation

2025/12/09 11:06

Abstract and 1 Introduction

  1. Related Works

    2.1. Vision-and-Language Navigation

    2.2. Semantic Scene Understanding and Instance Segmentation

    2.3. 3D Scene Reconstruction

  2. Methodology

    3.1. Data Collection

    3.2. Open-set Semantic Information from Images

    3.3. Creating the Open-set 3D Representation

    3.4. Language-Guided Navigation

  3. Experiments

    4.1. Quantitative Evaluation

    4.2. Qualitative Results

  4. Conclusion and Future Work, Disclosure statement, and References

ABSTRACT

Humans excel at forming mental maps of their surroundings, equipping them to understand object relationships and navigate based on language queries. Our previous work SI Maps [1] showed that having instance-level information and the semantic understanding of an environment helps significantly improve performance for language-guided tasks. We extend this instance-level approach to 3D while increasing the pipeline’s robustness and improving quantitative and qualitative results. Our method leverages foundational models for object recognition, image segmentation, and feature extraction. We propose a representation that results in a 3D point cloud map with instance-level embeddings, which bring in the semantic understanding that natural language commands can query. Quantitatively, the work improves upon the success rate of language-guided tasks. At the same time, we qualitatively observe the ability to identify instances more clearly and leverage the foundational models and language and image-aligned embeddings to identify objects that, otherwise, a closed-set approach wouldn’t be able to identify.

1. Introduction

Vision-Language Navigation(VLN) research has recently seen significant progress, enabling robots to navigate environments using natural language instructions [2, 3]. A major challenge in this field is grounding language descriptions onto real-world visual observations, especially for tasks requiring identifying specific object instances and spatial reasoning [4]. There is an increasing interest in building semantic spatial maps representing the environment and its objects [5]. Works like VLMaps [2] and NLMap [6] leverage pre-trained vision-language models to construct semantic spatial maps (SSMs) without manual labelling. However, these methods cannot differentiate between multiple instances of the same object, limiting their utility for instance-specific queries.

\ Our previous work [1] proposes a memory-efficient mechanism for creating a 2D semantic spatial representation of the environment with instance-level semantics directly applicable to robots navigating in real-world scenes. It was shown that Semantic Instance Maps (SI Maps) are computationally efficient to construct and allow for a broad range of complex and realistic commands that elude prior works. However, the map built in this previous work is a top-down view 2D map based on a closed set and assumes that the Large Language Model (LLM) knows all object categories in advance to map language queries to a specific object class. The 2D maps, though sufficient for a good range of tasks, limit performance when larger ones can obscure smaller objects.

\ Building upon [1], we introduce a novel approach for Semantic Instance Maps. Our work introduces Open-set 3D Semantic Instance Maps (O3D-SIM ), addressing the limitation of traditional closed-set methods [7, 8], which assume only predefined set objects will be encountered during operation. In contrast, O3D-SIM leverages methods to identify and potentially categorize unseen objects not explicitly included in the training data. This is crucial for real-world scenarios with diverse and unseen objects. This enables better performance and complex query handling for unseen object categories. We address the previously mentioned limitations by enabling instance-level object identification within the spatial representation and operating in an open-set manner, excelling in real-world scenarios. We have achieved major improvements over our previous work, which operated in a closed-set manner. The current pipeline leverages state-of-the-art foundational models like CLIP [9] and DINO [10]. These models extract semantic features from images, allowing them to recognize objects and understand the finer details and relationships between them. For example, the DINOv2 [10] model, trained on various chair images, can identify a chair in a new image and distinguish between a dining chair and an office chair.

\ This paper details the proposed pipeline for creating the 3D map and evaluates its effectiveness in VLN tasks for both simulation and real-world data. Experimental evaluations demonstrate the improvements achieved through our open-set approach, with an increase in correctly identified object instances, bringing our results closer to ground truth values even in challenging real-world data. Additionally, our pipeline achieves a higher success rate for complex language navigation queries targeting specific object instances, including those unseen during training, such as mannequins or bottles.

\ The contributions of our work can be summarized as follows:

\ (1) Extending the closed-set 2D instance-level approach from our previous work [1] to an open-set 3D Semantic Instance Map using state-of-the-art image segmentation, open-set image-language-aligned embeddings, and hierarchical clustering in 3D. The resulting map consists of a 3D point cloud with instance-level embeddings, enhancing the semantic understanding.

\ (2) Validating the effectiveness of the 3D map approach to VLN tasks through both qualitative and quantitative experiments.

\ The remainder of this paper is organized as follows. Section 2 reviews and analyzes recent literature, providing the necessary background on semantic scene understanding, 3D scene reconstruction, and VLN to the reader. Section 3 outlines the methodology behind our proposed 3D map. Section 4 discusses the experimental evaluation of our proposed 3D map’s effectiveness. Finally, concluding remarks and directions for future work are presented in Section 5.

\ Figure 1. We carry out complex instance-specific goal navigation in object-rich environments. These language queries refer to individual instances based on spatial and viewpoint configuration concerning other objects of the same type while preserving the navigation performance on standard language queries.

\

:::info This paper is available on arxiv under CC by-SA 4.0 Deed (Attribution-Sharealike 4.0 International) license.

:::

\

:::info Authors:

(1) Laksh Nanwani, International Institute of Information Technology, Hyderabad, India; this author contributed equally to this work;

(2) Kumaraditya Gupta, International Institute of Information Technology, Hyderabad, India;

(3) Aditya Mathur, International Institute of Information Technology, Hyderabad, India; this author contributed equally to this work.

(4) Swayam Agrawal, International Institute of Information Technology, Hyderabad, India;

(5) A.H. Abdul Hafez, Hasan Kalyoncu University, Sahinbey, Gaziantep, Turkey;

(6) K. Madhava Krishna, International Institute of Information Technology, Hyderabad, India.

:::

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Solana Price Stalls as Validator and Address Counts Collapse

Solana Price Stalls as Validator and Address Counts Collapse

The post Solana Price Stalls as Validator and Address Counts Collapse  appeared on BitcoinEthereumNews.com. Since mid-November, the Solana price has been resonating within a narrow consolidation of $145 and $125. Solana’s validator count collapsed from 2,500 to ~800 over two years, raising questions about economic sustainability. The number of active addresses on the Solana network recorded a sharp decline from 9.08 million in January 2025 to 3.75 million now, indicating a drop in user participation. On Tuesday, the crypto market witnessed a notable spike in buying pressure, leading major assets like Bitcoin, Ethereum, and Solana to a fresh recovery. However, the Solana price faced renewed selling at $145, evidenced by a long-wick rejection in the daily candle. The headwinds can be linked to networks facing scrutiny following a notable decline in active validators and active addresses.  Validator Exodus Exposes Economic Pressure on Solana Operators The layer-1 blockchain Solana has witnessed a sharp decline in the number of its validators from 2,500 in early 2023 to around 800 in late 2025, according to Solanacompass data. The collapse has caused an ecosystem divide between opposing camps. One side lauds the trend, arguing that the exodus comprises nearly exclusively unreal identities and poor-quality nodes that were gaming rewards without providing real hardware and uptime. In their view, narrowing the list down to a smaller number of committed validators strengthened the network rather than cooled it down. Infrastructure providers that work directly with node operators have a different story to tell. Teams like Layer 33, which is a collective of 25 independent Solana validators, say, “We personally know the teams shutting down. It is not mostly Sybils.” These operators cited increasing server costs, thin staking yields because of commission cuts, and increasing complexity of keeping nodes profitable as reasons for shutting down. Both sides agree on one thing: raw validator numbers don’t tell us much in and of…
Share
BitcoinEthereumNews2025/12/10 12:05
Surges to $94K One Day Ahead of Expected Fed Rate Cut

Surges to $94K One Day Ahead of Expected Fed Rate Cut

The post Surges to $94K One Day Ahead of Expected Fed Rate Cut appeared on BitcoinEthereumNews.com. What started as a slow U.S. morning on crypto markets has taken a quick turn, with bitcoin BTC$92,531.15 re-taking the $94,000 level. Hovering just above $90,000 earlier in the day, the largest crypto surged back to $94,000 minutes after 16:00 UTC, gaining more than $3,000 in less than an hour and up 4% over the past 24 hours. Ethereum’s ether ETH$3,125.08 jumped 5% during the same period, while native tokens of ADA$0.4648 and Chainlink LINK$14.25 climbed even more. The action went down while silver climbed to fresh record highs above $60 per ounce. While broader equity markets remained flat, crypto stocks followed bitcoin’s advance. Digital asset investment firm Galaxy (GLXY) and bitcoin miner CleanSpark (CLSK) led with gains of more than 10%, while Coinbase (COIN), Strategy (MSTR) and BitMine (BMNR) were up 4%-6%. While there was no single obvious catalyst for the quick move higher, BTC for weeks has been mostly selling off alongside the open of U.S. markets. Today’s change of pattern could point to seller exhaustion. Vetle Lunde, lead analyst at K33 Research, pointed to “deeply defensive” positioning on crypto derivatives markets with investors concerned about further weakness, and crowded positioning possibly contributing to the quick snapback. Further signs of bear market capitulation also emerged on Tuesday with Standard Chartered bull Geoff Kendrick slashing his outlook for the price of bitcoin for the next several years. The Coinbase bitcoin premium, which shows the BTC spot price difference on U.S.-centric exchange Coinbase and offshore exchange Binance, has also turned positive over the past few days, signaling U.S. investor demand making a comeback. Looking deeper into market structure, BTC’s daily price gain outpaced the rise in open interest on the derivatives market, suggesting that spot demand is fueling the rally instead of leverage. The Federal Reserve is expected to lower…
Share
BitcoinEthereumNews2025/12/10 11:51