ExchangeDEX+
Buy CryptoMarketsSpotFutures500XEarnEvents
More
Blue Chip Blitz
This article introduces OW‑VISCap, a unified framework for open‑world video instance segmentation and object‑centric captioning.This article introduces OW‑VISCap, a unified framework for open‑world video instance segmentation and object‑centric captioning.

See, Track, Describe: How OW‑VISCap Lets AI Tell the Story Behind Every Frame

By: Hackernoon
2025/11/04 17:11
Sleepless AI
AI$0.06346-1.90%
OpenLedger
OPEN$0.29129-0.57%

:::info Authors:

(1) Anwesa Choudhuri, University of Illinois at Urbana-Champaign ([email protected]);

(2) Girish Chowdhary, University of Illinois at Urbana-Champaign ([email protected]);

(3) Alexander G. Schwing, University of Illinois at Urbana-Champaign ([email protected]).

:::

Abstract and 1. Introduction

  1. Related Work

    2.1 Open-world Video Instance Segmentation

    2.2 Dense Video Object Captioning and 2.3 Contrastive Loss for Object Queries

    2.4 Generalized Video Understanding and 2.5 Closed-World Video Instance Segmentation

  2. Approach

    3.1 Overview

    3.2 Open-World Object Queries

    3.3 Captioning Head

    3.4 Inter-Query Contrastive Loss and 3.5 Training

  3. Experiments and 4.1 Datasets and Evaluation Metrics

    4.2 Main Results

    4.3 Ablation Studies and 4.4 Qualitative Results

  4. Conclusion, Acknowledgements, and References

\ Supplementary Material

A. Additional Analysis

B. Implementation Details

C. Limitations

\ Abstract. Open-world video instance segmentation is an important video understanding task. Yet most methods either operate in a closedworld setting, require an additional user-input, or use classic region-based proposals to identify never before seen objects. Further, these methods only assign a one-word label to detected objects, and don’t generate rich object-centric descriptions. They also often suffer from highly overlapping predictions. To address these issues, we propose Open-World Video Instance Segmentation and Captioning (OW-VISCap), an approach to jointly segment, track, and caption previously seen or unseen objects in a video. For this, we introduce open-world object queries to discover never before seen objects without additional user-input. We generate rich and descriptive object-centric captions for each detected object via a masked attention augmented LLM input. We introduce an inter-query contrastive loss to ensure that the object queries differ from one another. Our generalized approach matches or surpasses state-of-the-art on three tasks: open-world video instance segmentation on the BURST dataset, dense video object captioning on the VidSTG dataset, and closed-world video instance segmentation on the OVIS dataset.

\

1 Introduction

Open-world video instance segmentation (OW-VIS) involves detecting, segmenting and tracking previously seen or unseen objects in a video. This task is challenging because the objects are often never seen during training, are occasionally partly or entirely occluded, the appearance and position of these objects changes over time, and because the objects may leave the scene only to re-appear at a later time. Addressing these challenges to obtain an accurate method for OWVIS that works online is crucial in fields like autonomous systems, and augmented as well as virtual reality, among others.

\ Some recent methods based on abstract object queries perform remarkably well for closed-world video instance segmentation [7, 13, 18, 50]. These works assume a fixed set of object categories during training and evaluation. However, it is unrealistic to assume that all object categories are seen during training. For example, in Fig. 1, the trailer truck (top row) highlighted in yellow, and the lawn mower (bottom row) highlighted in green, are never seen before during training.

\ Fig. 1: OW-VISCap is able to simultaneously detect, track and caption objects in the given video frames. The first example (top row) shows a road scene with a previously unseen trailer truck and cars which are seen during training. The second example (bottom row) shows a person on a lawn mower, and a dog on the grass. The lawn mower isn’t part of the training set. We generate meaningful object-centric captions even for objects never seen during training. The captions for unseen objects are underlined.

\ For this reason, open-world video instance segmentation (OW-VIS) has been proposed [2,10,27,28,39,44]. Current works on OW-VIS suffer from the following three main issues. Firstly, they often require a prompt, i.e., additional input from the user, ground-truth or another network. The prompts can be in the form of points, bounding boxes or text. These methods only work when the additional inputs are available, making them less practical in the real-world. Prompt-less OW-VIS methods [2, 10, 27, 28, 39, 44] sometimes rely on classic region-based object proposals [2,27,28,44], or only operate on one kind of object query for both the open- and the closed-world [10, 39], which may result in sub-optimal results (shown later in Tab. 4). Secondly, all methods on video instance segmentation, closed- or open-world, assign a one-word label to the detected objects. However, a one word label is often not sufficient to describe an object. The ability to generate rich object-centric descriptions is important, especially in the open-world setting. DVOC-DS [58] jointly addresses the task of closed-world object detection and object-centric captioning in videos. However, it is not clear how DVOC-DS [58] can be extended to an open-world setting. Besides, the features from only the individual object trajectories are used for object-centric captioning in DVOCDS [58], so the overall context from the entire video frames may be lost in this method. DVOC-DS [58] also struggles with very long videos, and cannot caption multiple action segments within a single object trajectory because the method produces a single caption for the entire object trajectory. Thirdly, some of the aforementioned works [7, 8, 13, 18] suffer from multiple similar object queries resulting in repetitive predictions. Non-maximum suppression, or other postprocessing techniques may be necessary to suppress the repetitions and highly overlapping false positives.

\ We address the three aforementioned issues through our Open-World Video Instance Segmentation and Captioning (OW-VISCap) approach: it simultaneously detects, segments and generates object-centric captions for objects in a video. Fig. 1 shows two examples in which our method successfully detects, segments and captions both closed- and open-world objects.

\ To address the first issue, our OW-VISCap combines the advantages of both prompt-based and prompt-less methods. We introduce open-world object queries, in addition to closed-world object queries used in prior work [8]. This encourages discovery of never before seen open-world objects without compromising the closed-world performance much. Notably, we do not require additional prompts from the ground truth or separate networks. Instead, we use equally spaced points distributed across the video frames as prompts and encode them to form open-world object queries, which enables discovery of new objects. The equally spaced points incorporate information from different spatial regions of the given video-frames. We also introduce a specifically tailored open-world loss to train the open-world object queries to discover new objects.

\ To address the second issue, OW-VISCap includes a captioning head to produce an object-centric caption for each object query, both open- and closedworld. We use masked cross attention in an object-to-text transformer in the captioning head to generate object-centric text queries, that are then used by a frozen large language model (LLM) to produce an object-centric caption. Note, masked attention has been used for closed-world object segmentation [7,8]. However, to our best knowledge it has not been used for object captioning before. The masked cross attention helps focus on the local object features, whereas the self attention in the object-to-text transformer incorporates overall context by looking at the video-frame features. Moreover, unlike DVOC-DS [58], we are able to handle long videos and multiple action segments within a single object trajectory because we process short video clips sequentially and combine the clips using CAROQ [13].

\ To address the third issue, we introduce an inter-query contrastive loss for both open- and closed-world object queries. It encourages the object queries to differ from one another. This prevents repetitive predictions and encourages novel object discovery in the open-world. Note that this contrastive loss also helps in closed-world video instance segmentation by automatically encouraging non-maximum suppression, and by removing highly overlapping false positive predictions.

\ To demonstrate the efficacy of our OW-VISCap on open-world video instance segmentation and captioning, we evaluate this approach on three diverse and challenging tasks: open-world video instance segmentation (OW-VIS), dense video object captioning (Dense VOC), and closed-world video instance segmentation (VIS). We achieve a performance improvement of ∼ 6% on the previously unseen (uncommon) categories in the BURST [2] dataset for OW-VIS, and a ∼ 7% improvement on the captioning accuracy for detected objects on the VidSTG [57] dataset for the Dense VOC task, while performing similar to the state-of-the-art on the closed-world VIS task on the OVIS data (our AP score is 25.4 as compared to a score of 25.8 for a recent VIS SOTA, CAROQ [13]).

\ Fig. 2: The left figure shows an overview of our OW-VISCap (Sec. 3.1). We introduce open-world object queries qow (Sec. 3.2) and a captioning head (Sec. 3.3). The openworld object queries are generated by encoding a grid of points along the image-feature dimensions via a prompt encoder (shown in purple). The right figure details the captioning head (Sec. 3.3) for object-centric captioning. We use masked attention in the object-to-text transformer of the captioning head.

\

:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Which Best Crypto Presale Offers 100x?

Which Best Crypto Presale Offers 100x?

The post Which Best Crypto Presale Offers 100x? appeared on BitcoinEthereumNews.com. Crypto Presales Meta Description: Discover which of the leading crypto presales, Digitap ($TAP), BlockchainFX, or Bitcoin Hyper, offers the best 100x potential with innovative technologies. In a market filled with opportunities, could the next 100x crypto presale be lying in plain sight? Among the hottest tokens right now are Digitap ($TAP), BlockchainFX, and Bitcoin Hyper, each targeting various pain points in the crypto and traditional finance sectors. Digitap, the world’s first omni-bank, has already raised almost $1.7 million in its ongoing presale, giving investors a chance to buy $TAP tokens at just $0.0297, with a launch price of $0.14. Digitap could just be the best crypto to buy now in 2025. However, read on to find out why Digitap, BlockchainFX, and Bitcoin Hyper are emerging as top altcoins to buy this Q4. BlockchainFX: The Bridge Between Crypto and TradFi? BlockchainFX is aiming to enhance the trading environment with its all-in-one, crypto-native platform, which enables users to trade over 500 assets, including cryptocurrencies, forex, stocks, ETFs, futures, options, and bonds, all in one location. As one of the promising altcoins to buy in 2025, the $BFX token offers holders a unique option to earn daily rewards in USDT from up to 70% of the trading costs on the platform. With more than $10 million raised in the ongoing presale, $BFX has made good progress. However, Digitap ($TAP) shows stronger potential when comparing technological depth and real-world utility. Unlike BlockchainFX, Digitap integrates AI-enhanced routing for faster, borderless transactions and operates on a three-layer protocol. This advanced design gives Digitap a broader, more scalable edge, making it a more future-ready contender in the race for financial innovation. Bitcoin Hyper: Scaling Bitcoin for the Future? Bitcoin Hyper is targeting Bitcoin’s key limitations, which include slow transactions, high fees, and lack of programmability, by offering…
TAP Protocol
TAP$0.323-1.82%
Hyperlane
HYPER$0.17381-3.05%
Nowchain
NOW$0.00229+10.09%
Share
BitcoinEthereumNews2025/11/11 02:01
Bill In Advance To End The US Government Shutdown

Bill In Advance To End The US Government Shutdown

Key Takeaways Sunday saw a bold move where the US Senate went forward with the aim of putting an end to the US government shutdown that has shaken the whole country. The shutdown had caused severe problems in multiple sectors. Federal workers were sidelined, domestic flights were in disorder, and most importantly, the food aid ... Read more The post Bill In Advance To End The US Government Shutdown appeared first on BiteMyCoin.
Movement
MOVE$0.06438+2.46%
Moonveil
MORE$0.00468-8.12%
Share
Bitemycoin2025/11/11 01:58
The Economics of Self-Isolation: A Game-Theoretic Analysis of Contagion in a Free Economy

The Economics of Self-Isolation: A Game-Theoretic Analysis of Contagion in a Free Economy

Exploring how the costs of a pandemic can lead to a self-enforcing lockdown in a networked economy, analyzing the resulting changes in network structure and the existence of stable equilibria.
SQUID MEME
GAME$44.1636-1.32%
FreeRossDAO
FREE$0.00015181+6.91%
Share
Hackernoon2025/09/17 23:00

Trending News

More

Which Best Crypto Presale Offers 100x?

Bill In Advance To End The US Government Shutdown

The Economics of Self-Isolation: A Game-Theoretic Analysis of Contagion in a Free Economy

Matrixport: Bitcoin’s Price Surge to $105K May Be Tested Soon

IVLMap Solves Robot Navigation By Mapping Individual Objects

Quick Reads

More

DOGE Price Prediction & Analysis: Will Dogecoin Hit $50 by 2030?

Dropee Complete Guide: Earn Crypto Airdrops with Daily Quiz Game

Solana(SOL) Price Prediction 2030: Will SOL Reach 1,000 USDT?

What Is Privacy Coin? Top Privacy Coins to Trade in 2025

EOS Price Prediction: Can EOS Reach $50 or Even $100 in the Next 10 Years?

Crypto Prices

mc_price_img_alt

Bitcoin

BTC

$105,501.02
$105,501.02$105,501.02

+0.42%

mc_price_img_alt

Ethereum

ETH

$3,539.08
$3,539.08$3,539.08

+0.55%

mc_price_img_alt

XRP

XRP

$2.5482
$2.5482$2.5482

+0.75%

mc_price_img_alt

Solana

SOL

$165.67
$165.67$165.67

-0.37%

mc_price_img_alt

DOGE

DOGE

$0.17892
$0.17892$0.17892

-0.17%