Introduces O3D-SIM, an Open-set 3D Semantic Instance Map that uses CLIP/DINO embeddings for robust Vision-Language NavigationIntroduces O3D-SIM, an Open-set 3D Semantic Instance Map that uses CLIP/DINO embeddings for robust Vision-Language Navigation

3D Semantic Instance Maps: Leveraging Foundation Models for Language-Guided Navigation

2025/12/09 11:06

Abstract and 1 Introduction

  1. Related Works

    2.1. Vision-and-Language Navigation

    2.2. Semantic Scene Understanding and Instance Segmentation

    2.3. 3D Scene Reconstruction

  2. Methodology

    3.1. Data Collection

    3.2. Open-set Semantic Information from Images

    3.3. Creating the Open-set 3D Representation

    3.4. Language-Guided Navigation

  3. Experiments

    4.1. Quantitative Evaluation

    4.2. Qualitative Results

  4. Conclusion and Future Work, Disclosure statement, and References

ABSTRACT

Humans excel at forming mental maps of their surroundings, equipping them to understand object relationships and navigate based on language queries. Our previous work SI Maps [1] showed that having instance-level information and the semantic understanding of an environment helps significantly improve performance for language-guided tasks. We extend this instance-level approach to 3D while increasing the pipeline’s robustness and improving quantitative and qualitative results. Our method leverages foundational models for object recognition, image segmentation, and feature extraction. We propose a representation that results in a 3D point cloud map with instance-level embeddings, which bring in the semantic understanding that natural language commands can query. Quantitatively, the work improves upon the success rate of language-guided tasks. At the same time, we qualitatively observe the ability to identify instances more clearly and leverage the foundational models and language and image-aligned embeddings to identify objects that, otherwise, a closed-set approach wouldn’t be able to identify.

1. Introduction

Vision-Language Navigation(VLN) research has recently seen significant progress, enabling robots to navigate environments using natural language instructions [2, 3]. A major challenge in this field is grounding language descriptions onto real-world visual observations, especially for tasks requiring identifying specific object instances and spatial reasoning [4]. There is an increasing interest in building semantic spatial maps representing the environment and its objects [5]. Works like VLMaps [2] and NLMap [6] leverage pre-trained vision-language models to construct semantic spatial maps (SSMs) without manual labelling. However, these methods cannot differentiate between multiple instances of the same object, limiting their utility for instance-specific queries.

\ Our previous work [1] proposes a memory-efficient mechanism for creating a 2D semantic spatial representation of the environment with instance-level semantics directly applicable to robots navigating in real-world scenes. It was shown that Semantic Instance Maps (SI Maps) are computationally efficient to construct and allow for a broad range of complex and realistic commands that elude prior works. However, the map built in this previous work is a top-down view 2D map based on a closed set and assumes that the Large Language Model (LLM) knows all object categories in advance to map language queries to a specific object class. The 2D maps, though sufficient for a good range of tasks, limit performance when larger ones can obscure smaller objects.

\ Building upon [1], we introduce a novel approach for Semantic Instance Maps. Our work introduces Open-set 3D Semantic Instance Maps (O3D-SIM ), addressing the limitation of traditional closed-set methods [7, 8], which assume only predefined set objects will be encountered during operation. In contrast, O3D-SIM leverages methods to identify and potentially categorize unseen objects not explicitly included in the training data. This is crucial for real-world scenarios with diverse and unseen objects. This enables better performance and complex query handling for unseen object categories. We address the previously mentioned limitations by enabling instance-level object identification within the spatial representation and operating in an open-set manner, excelling in real-world scenarios. We have achieved major improvements over our previous work, which operated in a closed-set manner. The current pipeline leverages state-of-the-art foundational models like CLIP [9] and DINO [10]. These models extract semantic features from images, allowing them to recognize objects and understand the finer details and relationships between them. For example, the DINOv2 [10] model, trained on various chair images, can identify a chair in a new image and distinguish between a dining chair and an office chair.

\ This paper details the proposed pipeline for creating the 3D map and evaluates its effectiveness in VLN tasks for both simulation and real-world data. Experimental evaluations demonstrate the improvements achieved through our open-set approach, with an increase in correctly identified object instances, bringing our results closer to ground truth values even in challenging real-world data. Additionally, our pipeline achieves a higher success rate for complex language navigation queries targeting specific object instances, including those unseen during training, such as mannequins or bottles.

\ The contributions of our work can be summarized as follows:

\ (1) Extending the closed-set 2D instance-level approach from our previous work [1] to an open-set 3D Semantic Instance Map using state-of-the-art image segmentation, open-set image-language-aligned embeddings, and hierarchical clustering in 3D. The resulting map consists of a 3D point cloud with instance-level embeddings, enhancing the semantic understanding.

\ (2) Validating the effectiveness of the 3D map approach to VLN tasks through both qualitative and quantitative experiments.

\ The remainder of this paper is organized as follows. Section 2 reviews and analyzes recent literature, providing the necessary background on semantic scene understanding, 3D scene reconstruction, and VLN to the reader. Section 3 outlines the methodology behind our proposed 3D map. Section 4 discusses the experimental evaluation of our proposed 3D map’s effectiveness. Finally, concluding remarks and directions for future work are presented in Section 5.

\ Figure 1. We carry out complex instance-specific goal navigation in object-rich environments. These language queries refer to individual instances based on spatial and viewpoint configuration concerning other objects of the same type while preserving the navigation performance on standard language queries.

\

:::info This paper is available on arxiv under CC by-SA 4.0 Deed (Attribution-Sharealike 4.0 International) license.

:::

\

:::info Authors:

(1) Laksh Nanwani, International Institute of Information Technology, Hyderabad, India; this author contributed equally to this work;

(2) Kumaraditya Gupta, International Institute of Information Technology, Hyderabad, India;

(3) Aditya Mathur, International Institute of Information Technology, Hyderabad, India; this author contributed equally to this work.

(4) Swayam Agrawal, International Institute of Information Technology, Hyderabad, India;

(5) A.H. Abdul Hafez, Hasan Kalyoncu University, Sahinbey, Gaziantep, Turkey;

(6) K. Madhava Krishna, International Institute of Information Technology, Hyderabad, India.

:::

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Upbit to Raise Cold Wallet Ratio to 99% Amid Liquidity Concerns

Upbit to Raise Cold Wallet Ratio to 99% Amid Liquidity Concerns

The post Upbit to Raise Cold Wallet Ratio to 99% Amid Liquidity Concerns appeared on BitcoinEthereumNews.com. South Korea’s largest cryptocurrency exchange, Upbit, announced plans to increase its cold wallet storage ratio to 99%, following a major security breach last month. The announcement comes as part of a comprehensive security overhaul following hackers’ theft of approximately 44.5 billion won ($31 million) in Solana-based assets on November 27. Upbit Strengthens Security After Second November 27 Breach According to operator Dunamu, Upbit currently maintains 98.33% of customer digital assets in cold storage as of late October, with only 1.67% held in hot wallets. The exchange stated it has completed a full wallet infrastructure overhaul and aims to reduce hot wallet holdings to below 1% in the coming months. Dunamu emphasized that customer asset protection remains Upbit’s top priority, with all breach-related losses covered by the company’s reserves. Sponsored Sponsored The breach marked Upbit’s second major hack on the same date six years ago. In 2019, North Korean hacking groups Lazarus and Andariel stole 342,000 ETH from the exchange’s hot wallet. This time, attackers drained 24 different Solana network tokens in just 54 minutes during the early morning hours. Under South Korea’s Virtual Asset User Protection Act, exchanges must store at least 80% of customer assets in cold wallets. Upbit significantly exceeds this threshold and maintains the lowest hot wallet ratio among domestic exchanges. Data released by lawmaker Huh Young showed that other Korean exchanges were operating with cold wallet ratios of 82% to 90% as of June. Upbit Outpaces Global Industry Standards Upbit’s security metrics compare favorably with those of major global exchanges. Coinbase stores approximately 98% of customer funds in cold storage, while Kraken maintains 95-97% of its funds offline. OKX, Gate.io, and MEXC each keep around 95% of their funds in cold wallets. Binance and Bybit have not disclosed specific ratios but emphasize that the majority of…
Share
BitcoinEthereumNews2025/12/10 13:37
Tidal Trust Files For ‘Bitcoin AfterDark ETF’, Could Off-Hours Trading Boost Returns?

Tidal Trust Files For ‘Bitcoin AfterDark ETF’, Could Off-Hours Trading Boost Returns?

The post Tidal Trust Files For ‘Bitcoin AfterDark ETF’, Could Off-Hours Trading Boost Returns? appeared on BitcoinEthereumNews.com. Tidal Trust has filed for the first Bitcoin AfterDark ETF with the U.S. SEC. The product looks to capture overnight price movements of the token. What Is the Bitcoin AfterDark ETF? Tidal Trust has filed with the SEC for its proposed Bitcoin AfterDark ETF product. It is an ETF that would hold the coin only during non-trading hours in the United States. This filing also seeks permission for two other BTC-linked products managed with Nicholas Wealth Management. Source: SEC According to the registration documents, the ETF would buy Bitcoin at the close of U.S. markets and then sell the position the following morning upon the reopening of trading. In other words, it will effectively hold BTC only over the night “The fund trades those instruments during U.S. overnight hours and closes them out shortly after the U.S. market opens each trading day,” the filing said. During the day, the fund’s assets switch to U.S. Treasuries, money-market funds, and similar cash instruments. That means even when the fund has 100% notional exposure to Bitcoin overnight, a substantial portion of its capital may still sit in Treasuries during the day. Eric Balchunas, senior ETF analyst cited earlier research and said, “most of Bitcoin’s gains historically occur outside U.S. market hours.” If those patterns persist, the Bitcoin AfterDark ETF token will outperform more traditional spot BTC products, he said. Source: X Balchunas added that the effect may be partly driven by positioning in existing Bitcoin ETFs and related derivatives activity. The SEC has of late taken an increasingly more accommodating approach toward crypto-related ETFs. This September, for instance, REX Shares launched the first Ethereum Staking ETF. It represented direct ETH exposure and paid out on-chain staking rewards.  Also on Tuesday, BlackRock filed an application for an iShares Staked Ethereum ETF. The filing states…
Share
BitcoinEthereumNews2025/12/10 13:00
Tempo Testnet Goes Live with Stablecoin Tools and Expanded Partners

Tempo Testnet Goes Live with Stablecoin Tools and Expanded Partners

The post Tempo Testnet Goes Live with Stablecoin Tools and Expanded Partners appeared on BitcoinEthereumNews.com. The Tempo testnet, developed by Stripe and Paradigm, is now live, enabling developers to run nodes, sync the chain, and test stablecoin features for payments. This open-source platform emphasizes scale, reliability, and integration, paving the way for instant settlements on a dedicated layer-1 blockchain. Tempo testnet launches with six core features, including stablecoin-native gas and fast finality, optimized for financial applications. Developers can create stablecoins directly in browsers using the TIP-20 standard, enhancing accessibility for testing. The project has secured $500 million in funding at a $5 billion valuation, with partners like Mastercard and Klarna driving adoption; Klarna launched a USD-pegged stablecoin last month. Discover the Tempo testnet launch by Stripe and Paradigm: test stablecoins, run nodes, and explore payment innovations on this layer-1 blockchain. Join developers in shaping the future of crypto payments today. What is the Tempo Testnet? Tempo testnet represents a pivotal milestone in the development of a specialized layer-1 blockchain for payments, created through a collaboration between Stripe and Paradigm. This public testnet allows participants to run nodes, synchronize the chain, and experiment with essential features tailored for stablecoin operations and financial transactions. By focusing on instant settlements and low fees, it addresses key limitations in traditional blockchains for real-world payment use cases. Source: Patrick Collison The Tempo testnet builds on the project’s foundation, which was first announced four months ago, with an emphasis on developer-friendly tools. It supports a range of functionalities that prioritize reliability and scalability, making it an ideal environment for testing before the mainnet rollout. As per the official announcement from Tempo, this phase will involve ongoing enhancements, including new infrastructure partnerships and stress tests under simulated payment volumes. One of the standout aspects of the Tempo testnet is its open-source nature, inviting broad community involvement. This approach not only accelerates development…
Share
BitcoinEthereumNews2025/12/10 13:01