MIVPG's CSA module remains effective when switching from FLAN-T5-XL to the OPT-2.7b LLM architecture.MIVPG's CSA module remains effective when switching from FLAN-T5-XL to the OPT-2.7b LLM architecture.

Cross-Model Validation: MIVPG's Efficacy on Encoder-Decoder vs. Decoder-Only LLMs

2025/11/20 00:30
4 min read
For feedback or concerns regarding this content, please contact us at [email protected]

Abstract and 1 Introduction

  1. Related Work

    2.1. Multimodal Learning

    2.2. Multiple Instance Learning

  2. Methodology

    3.1. Preliminaries and Notations

    3.2. Relations between Attention-based VPG and MIL

    3.3. MIVPG for Multiple Visual Inputs

    3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios

  3. Experiments and 4.1. General Setup

    4.2. Scenario 1: Samples with Single Image

    4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding

    4.4. Scenario 3: Samples with Multiple Images, with Each Image Having Multiple Patches to be Considered and 4.5. Case Study

  4. Conclusion and References

\ Supplementary Material

A. Detailed Architecture of QFormer

B. Proof of Proposition

C. More Experiments

C. More Experiments

We implemented the proposed method on NVIDIA A100 GPUs with BFloat16. Except for the number of training epochs mentioned in the main paper, we kept all other hyperparameters the same as in BLIP2[22]. For PatchGastricADC22[36] and ABO[7], we trained the model for 40 epochs.

\ Figure 8. Experiment results on MSCOCO with or without freezing the visual encoder. We adopt the metrics used in [22].

C.1. Frozen Visual Models

In the original BLIP2[22], image sizes are upscaled to 364 × 364, and consequently, the ViT is unfrozen during the fine-tuning process. This approach yields slightly better performance, albeit at a higher computational cost while training on the entire COCO training set.

\ In this section, we validate the performance of finetuning while keeping the ViT frozen and image sizes unchanged. Experiment results can be seen as Figure 8. We observed that when working with limited data, such as 50K samples, models exhibit comparable performance whether or not the visual encoder (ViT) is frozen. However, as the number of training epochs increases, the performance gap varies. In some cases, unfreezing the ViT leads to improved performance, while in others, the opposite holds true. Considering that many real-world applications may not have access to massive training data, freezing the ViT can be a more efficient approach while still maintaining similar performance levels.

C.2. Case Study

In the main paper, we employ the FLAN-T5-XL as the language model. Existing large language models can be broadly categorized into two types: encoder-decoder based and decoder-only based models. The FLAN-T5-XL falls into the former category. The decoder-only based models are more computationally efficient and the encoder-decoder based models can handle more sophisticated tasks. In this section, we assess the performance of MIVPG on models from the decoder-only category. Specifically, we use the BLIP2[22] with OPT-2.7b[47] as the base LLM. We validate the performance on the PatchGastricADC22 dataset. In the experiments, we only replace the LLM while keeping other hyperparameters unchanged.

\ Table 4. Experiments on the PatchGastricADC22 dataset [36] with OPT-2.7b as the language model

\ The experiment results on PatchGastricADC22 using OPT-2.7b as the language model are presented in Table 4. Overall, the model continues to outperform the baselines shown in Table 1, emphasizing the advantages of integrating MLLMs into the WSI captioning task. Notably, the model with CSA performs better than the one without it, reaffirming the effectiveness of CSA. It’s also worth noting that the performance of using OPT-2.7b is not superior to using Flan-T5-XL. This could be attributed, in part, to the insufficiency of training data. Since OPT-2.7b is relatively less sophisticated, more training data may be required to train a more powerful model.

C.3. More Visualization

This section provides additional visualization results on the ABO dataset, including both patch-level attention weights and image-level attention weights. In the patch-level attention weights, it is evident that the model excels in detecting the shapes of objects, as a significant portion of the patch-level weights is assigned to edges and contours. The image-level attention weights display maps for all twelve heads. Each row in a map represents a query, while each column represents an image. It’s important to note that different heads and queries exhibit varying attention patterns towards the images, demonstrating the diversity in how the model processes and attends to the input images.

\ References: ['a short bar with weights at each end that is used usually in pairs for exercise.', 'gymnastic weight for dogs.', 'a six sided barbell.', 'a hexagonal shaped dumbbell', 'pesa de mano con exterior hexagonal']Inference: a six sided barbell.

\ References: ['a chair with a metal square like right forming the armrests and the legs and also having lines on the backrest and the seat.', 'the chair is composed of a seat and a square backrest with two armrests and two square legs on each side', 'a one seater chair with flat metal armrests extended to form the legs.', 'a one seater chair with flat metal armrests extended to form the legs and having rows patterns on the backrest and seat pillow.']Inference: a one seater chair with flat armrests and having four thin standing legs.

\ References: ['a stool with raised legs and a curved seat.', 'a rectangular top stool with four standing legs.', 'gray chair without arms or back with four legs that support it.', 'a stool with a rectangular cushion and has four long legs', 'a rectangular top stool .']Inference: a rectangular top stool with four standing legs.

\ References: ['single seater tufted sofa with seat attached to long back and side arms having for wooden legs.', 'a one seater chair with hole pattern on the backrest and having low armrests.']Inference: a one seater chair with hole pattern on the backrest and having four thin standing legs.

\ References: ['a one seater sofa with huge back pillow and broad seat pillow and having four thin standing legs.', 'a broad one seater sofa with four thin standing legs.', 'one seater sofa with armrests on each side and four short legs. the seat and backrest have rounded rectangular cushions', 'one seater sofas wide with medium legs']Inference: a one seater sofa with huge back pillow and having four thin standing legs.

\ References: ['durable spongy waterproof small size pillow.', 'a structure, observed in certain extrusive igneous rocks, that is characterized by discontinuous pillow-shaped masses ranging in size from a few centimeters to a meter or more in greatest dimension (commonly between 30 cm and 60 cm).', 'square pillow with cushions', 'sofa,bed pillow, square in shape']Inference: a square shaped pillow

\ References: ['a two seater sofa with four thin standing legs, cylindrical pillow at each armrest and square pattern on the seat.']Inference: a two seater sofa with square pillow at each armrest and having four thin standing legs.

\ References: ['a cubical shaped ottoman seat.', 'a seat with a square shaped cushion', 'a cube shaped ottoman seat.']Inference: a cube shaped ottoman seat with square shaped legs.

\ References: ['the sofa consists of two huge pieces with armrests. it has two big size pillows and four little legs.', 'the couch has square set and back consists of square pillow held up by four curved legs.', 'a two seater sofa with huge back pillows and four thin standing legs.', 'two seater sofa with armrests on each side. it has rounded rectangular cushions on the seats and backrest']Inference: a two seater sofa with huge back pillows and having four thin standing legs.

\ References: ['a one seater sofa with broad seat pillow.', 'padded gray sofa.', 'a one seater sofa with broad seat and four small standing legs.']Inference: a one seater sofa with huge back pillow and seat pillow and having four short standing legs.

\

:::info Authors:

(1) Wenliang Zhong, The University of Texas at Arlington ([email protected]);

(2) Wenyi Wu, Amazon ([email protected]);

(3) Qi Li, Amazon ([email protected]);

(4) Rob Barton, Amazon ([email protected]);

(5) Boxin Du, Amazon ([email protected]);

(6) Shioulin Sam, Amazon ([email protected]);

(7) Karim Bouyarmane, Amazon ([email protected]);

(8) Ismail Tutar, Amazon ([email protected]);

(9) Junzhou Huang, The University of Texas at Arlington ([email protected]).

:::


:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

\

Market Opportunity
CROSS Logo
CROSS Price(CROSS)
$0.06897
$0.06897$0.06897
+0.65%
USD
CROSS (CROSS) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

FCA, crackdown on crypto

FCA, crackdown on crypto

The post FCA, crackdown on crypto appeared on BitcoinEthereumNews.com. The regulation of cryptocurrencies in the United Kingdom enters a decisive phase. The Financial Conduct Authority (FCA) has initiated a consultation to set minimum standards on transparency, consumer protection, and digital custody, in order to strengthen market confidence and ensure safer operations for exchanges, wallets, and crypto service providers. The consultation was published on May 2, 2025, and opened a public discussion on operational responsibilities and safeguarding requirements for digital assets (CoinDesk). The goal is to make the rules clearer without hindering the sector’s evolution. According to the data collected by our regulatory monitoring team, in the first weeks following the publication, the feedback received from professionals and operators focused mainly on custody, incident reporting, and insurance requirements. Industry analysts note that many responses require technical clarifications on multi-sig, asset segregation, and recovery protocols, as well as proposals to scale obligations based on the size of the operator. FCA Consultation: What’s on the Table The consultation document clarifies how to apply rules inspired by traditional finance to the crypto perimeter, balancing innovation, market integrity, and user protection. In this context, the goal is to introduce minimum standards for all firms under the supervision of the FCA, an essential step for a more transparent and secure sector, with measurable benefits for users. The proposed pillars Obligations towards consumers: assessment on the extension of the Consumer Duty – a requirement that mandates companies to provide “good outcomes” – to crypto services, with outcomes for users that are traceable and verifiable. Operational resilience: introduction of continuity requirements, incident response plans, and periodic testing to ensure the operational stability of platforms even in adverse scenarios. Financial Crime Prevention: strengthening AML/CFT measures through more stringent transaction monitoring and structured counterpart checks. Custody and safeguarding: definition of operational methods for the segregation of client assets, secure…
Share
BitcoinEthereumNews2025/09/18 05:40
From Under $0.0025 to $0.25 Over the Next 10 Weeks? Little Pepe (LILPEPE) Named Best Crypto to Buy in 2025 Over Ripple (XRP)

From Under $0.0025 to $0.25 Over the Next 10 Weeks? Little Pepe (LILPEPE) Named Best Crypto to Buy in 2025 Over Ripple (XRP)

The post From Under $0.0025 to $0.25 Over the Next 10 Weeks? Little Pepe (LILPEPE) Named Best Crypto to Buy in 2025 Over Ripple (XRP) appeared on BitcoinEthereumNews.com. The cryptocurrency sector is dynamic and vital for major and minor players alike. With every boom, new categories of tokens are introduced that make new market predictions based on new sets of metrics.  Many believe that, apart from having an appreciated use case that makes it easily attain adoption, Ripple (XRP) has already established itself as a vital part of the blockchain system. But as it turns out, a new competitor, Little Pepe (LILPEPE), has generated significant buzz. Little Pepe is projected to appreciate to 100x its current price of 0.0021, reach 0.25 in 2025, and is considered a top pick for 2025. Ripple (XRP): Dependable but Predictable Ripple has dominated cross-border payment technology for many years. Priced at around $2.98, Ripple remains well supported by partnerships with industry leaders and its increasing contribution to payment processing.  Analysts predict XRP to be at the $7 to $10 range by 2026 and the recent favorable legal rulings Ripple has received in the United States has heightened optimism surrounding the token. For conservative investors, XRP represents stability in an otherwise volatile sector. However, its large market capitalization makes 50x or 100x gains virtually impossible within one cycle. Ripple is a strong asset in the utility sense, but lacks the utility that smaller tokens can bring. Little Pepe (LILPEPE): Presale Energy With a Twist Little Pepe is capturing the attention of investors with its outstanding presale performance. Currently, the presale is in Stage 12, and each stage sells out faster and faster. presale is at $0.0021.  Each stage is selling out faster and faster. Analysts speculate the token could rise to $0.25 within 10 weeks after listing. Such a rise would be one of recent memory’s most remarkable early runs. What makes Little Pepe different is its dual identity. On the surface, it…
Share
BitcoinEthereumNews2025/09/18 15:34
South Korea’s Crypto Crackdown: Tax Agency to Secure Seized Digital Assets with Private Custodian

South Korea’s Crypto Crackdown: Tax Agency to Secure Seized Digital Assets with Private Custodian

BitcoinWorld South Korea’s Crypto Crackdown: Tax Agency to Secure Seized Digital Assets with Private Custodian SEOUL, South Korea – The National Tax Service (NTS
Share
bitcoinworld2026/03/20 16:20