Proves Q-Former is a Multi-Head MIL module due to permutation invariance in its cross-attention. Notes its limitation: it assumes i.i.d. instances, overlooking crucial instance correlation.Proves Q-Former is a Multi-Head MIL module due to permutation invariance in its cross-attention. Notes its limitation: it assumes i.i.d. instances, overlooking crucial instance correlation.

MIL Perspective: Analyzing Q-Former as a Multi-Head Mechanism

2025/11/14 10:52
3 min read
For feedback or concerns regarding this content, please contact us at [email protected]

Abstract and 1 Introduction

  1. Related Work

    2.1. Multimodal Learning

    2.2. Multiple Instance Learning

  2. Methodology

    3.1. Preliminaries and Notations

    3.2. Relations between Attention-based VPG and MIL

    3.3. MIVPG for Multiple Visual Inputs

    3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios

  3. Experiments and 4.1. General Setup

    4.2. Scenario 1: Samples with Single Image

    4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding

    4.4. Scenario 3: Samples with Multiple Images, with Each Image Having Multiple Patches to be Considered and 4.5. Case Study

  4. Conclusion and References

\ Supplementary Material

A. Detailed Architecture of QFormer

B. Proof of Proposition

C. More Experiments

3.2. Relations between Attention-based VPG and MIL

\ In AB-MIL[16], weights are calculated as Equation 5.

\

\ Proposition 1. QFormer belongs to the category of Multiple Instance Learning modules.

\ Within the cross-attention layer of QFormer, every query token computes weights for image embeddings. Query embeddings, being learnable parameters, can be seen as a linear transformation from an instance to its weight. To provide further clarification, each row in the attention map A signifies the weights assigned to instances for aggregation. Consequently, the cross-attention between the learnable query embeddings and the input is permutation invariance.

\ The result of cross-attention is combined with the original query embeddings using a residual connection. This process can be expressed as shown in Equation 6, by replacing pool with Equation 1, and setting λ = γ = I, as illustrated in Equation 7, which is permutation equivalence.

\

\ Figure 2. Overview of MIVPG. 2a: When handling multiple visual inputs, the initial step involves aggregating them at the image-level. QFormer can be treated as a Multiple Instance Learning module that takes multiple samples as instances. The MIVPG complements QFormer by introducing a correlated self-attention module and the pyramid positional encoding module, depending on specific scenarios. 2b: Image-level aggregation can employ various MIL strategies, either learnable, such as AB-MIL, or fixed, for example, always selecting a specific token. 2c: The visual prompt embeddings produced by Q-Former are combined with textual prompt embeddings and forwarded to the LLM for generating outputs.

\ Considering that the self-attention layer within the QFormer block adheres to the principles of permutation equivalence, we can conceptualize the QFormer as a multi-head MIL mechanism.

\ From the standpoint of MIL, the weighted pooling in Equation 1 operates under the assumption that instances are independent and identically distributed (i.i.d)[34]. However, in practical scenarios, instances may exhibit correlations, and accounting for instance correlation can lead to improved performance. It’s worth noting that when each sample contains only one image, the input to QFormer comprises patch embeddings that have already incorporated correlations through the self-attention layer in ViT. Moreover, performance enhancement is attainable through the integration of a Pyramid Positional Encoding Generator (PPEG)[34], which complements the proposed MIVPG when handling single-image inputs.

\

:::info Authors:

(1) Wenliang Zhong, The University of Texas at Arlington ([email protected]);

(2) Wenyi Wu, Amazon ([email protected]);

(3) Qi Li, Amazon ([email protected]);

(4) Rob Barton, Amazon ([email protected]);

(5) Boxin Du, Amazon ([email protected]);

(6) Shioulin Sam, Amazon ([email protected]);

(7) Karim Bouyarmane, Amazon ([email protected]);

(8) Ismail Tutar, Amazon ([email protected]);

(9) Junzhou Huang, The University of Texas at Arlington ([email protected]).

:::


:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

\

Market Opportunity
Quack AI Logo
Quack AI Price(Q)
$0.01214
$0.01214$0.01214
+1.65%
USD
Quack AI (Q) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.
Tags:

You May Also Like

Crypto News: Donald Trump-Aligned Fed Governor To Speed Up Fed Rate Cuts?

Crypto News: Donald Trump-Aligned Fed Governor To Speed Up Fed Rate Cuts?

The post Crypto News: Donald Trump-Aligned Fed Governor To Speed Up Fed Rate Cuts? appeared on BitcoinEthereumNews.com. In recent crypto news, Stephen Miran swore in as the latest Federal Reserve governor on September 16, 2025, slipping into the board’s last open spot right before the Federal Open Market Committee kicks off its two-day rate discussion. Traders are betting heavily on a 25-basis-point trim, which would bring the federal funds rate down to 4.00%-4.25%, based on CME FedWatch Tool figures from September 15, 2025. Miran, who’s been Trump’s top economic advisor and a supporter of his trade ideas, joins a seven-member board where just three governors come from Democratic picks, according to the Fed’s records updated that same day. Crypto News: Miran’s Background and Quick Path to Confirmation The Senate greenlit Miran on September 15, 2025, with a tight 48-47 vote, following his nomination on September 2, 2025, as per a recent crypto news update. His stint runs only until January 31, 2026, stepping in for Adriana D. Kugler, who stepped down in August 2025 for reasons not made public. Miran earned his economics Ph.D. from Harvard and worked at the Treasury back in Trump’s first go-around. Afterward, he moved to Hudson Bay Capital Management as an economist, then looped back to the White House in December 2024 to head the Council of Economic Advisers. There, he helped craft Trump’s “reciprocal tariffs” approach, aimed at fixing trade gaps with China and the EU. He wouldn’t quit his White House gig, which irked Senator Elizabeth Warren at the September 7, 2025, confirmation hearings. That limited time frame means Miran gets to cast a vote straight away at the FOMC session starting September 16, 2025. The full board now features Chair Jerome H. Powell (Trump pick, term ends 2026), Vice Chair Philip N. Jefferson (Biden, to 2036), and folks like Lisa D. Cook (Biden, to 2028) and Michael S. Barr…
Share
BitcoinEthereumNews2025/09/18 03:14
CEO Sandeep Nailwal Shared Highlights About RWA on Polygon

CEO Sandeep Nailwal Shared Highlights About RWA on Polygon

The post CEO Sandeep Nailwal Shared Highlights About RWA on Polygon appeared on BitcoinEthereumNews.com. Polygon CEO Sandeep Nailwal highlighted Polygon’s lead in global bonds, Spiko US T-Bill, and Spiko Euro T-Bill. Polygon published an X post to share that its roadmap to GigaGas was still scaling. Sentiments around POL price were last seen to be bearish. Polygon CEO Sandeep Nailwal shared key pointers from the Dune and RWA.xyz report. These pertain to highlights about RWA on Polygon. Simultaneously, Polygon underlined its roadmap towards GigaGas. Sentiments around POL price were last seen fumbling under bearish emotions. Polygon CEO Sandeep Nailwal on Polygon RWA CEO Sandeep Nailwal highlighted three key points from the Dune and RWA.xyz report. The Chief Executive of Polygon maintained that Polygon PoS was hosting RWA TVL worth $1.13 billion across 269 assets plus 2,900 holders. Nailwal confirmed from the report that RWA was happening on Polygon. The Dune and https://t.co/W6WSFlHoQF report on RWA is out and it shows that RWA is happening on Polygon. Here are a few highlights: – Leading in Global Bonds: Polygon holds 62% share of tokenized global bonds (driven by Spiko’s euro MMF and Cashlink euro issues) – Spiko U.S.… — Sandeep | CEO, Polygon Foundation (※,※) (@sandeepnailwal) September 17, 2025 The X post published by Polygon CEO Sandeep Nailwal underlined that the ecosystem was leading in global bonds by holding a 62% share of tokenized global bonds. He further highlighted that Polygon was leading with Spiko US T-Bill at approximately 29% share of TVL along with Ethereum, adding that the ecosystem had more than 50% share in the number of holders. Finally, Sandeep highlighted from the report that there was a strong adoption for Spiko Euro T-Bill with 38% share of TVL. He added that 68% of returns were on Polygon across all the chains. Polygon Roadmap to GigaGas In a different update from Polygon, the community…
Share
BitcoinEthereumNews2025/09/18 01:10
T7X Launches Regulated Launchpad for Tokenized Real-World Asset Securities

T7X Launches Regulated Launchpad for Tokenized Real-World Asset Securities

SHERIDAN, Wyo., March  18, 2026  (GLOBE NEWSWIRE) -- T7X announces the launch of the T7X Launchpad, a digital issuance platform designed to support the crea
Share
CryptoReporter2026/03/18 20:49