This ablation study on AdaMix highlights the factors driving its efficiency in parameter-efficient fine-tuning. Results show that adaptation merging consistently outperforms random or fixed routing, while consistency regularization proves essential to maintaining strong performance. Module sharing is particularly effective in low-resource tasks, boosting convergence speed and lowering training loss compared to models without sharing. Experiments with adaptation module count and bottleneck dimension reveal diminishing returns, stressing the importance of balance over brute force scaling. Overall, AdaMix demonstrates how thoughtful design choices yield superior results to full model tuning.This ablation study on AdaMix highlights the factors driving its efficiency in parameter-efficient fine-tuning. Results show that adaptation merging consistently outperforms random or fixed routing, while consistency regularization proves essential to maintaining strong performance. Module sharing is particularly effective in low-resource tasks, boosting convergence speed and lowering training loss compared to models without sharing. Experiments with adaptation module count and bottleneck dimension reveal diminishing returns, stressing the importance of balance over brute force scaling. Overall, AdaMix demonstrates how thoughtful design choices yield superior results to full model tuning.

The Role of Consistency and Sharing in Efficient Fine-Tuning

2025/10/01 21:00
3 min read
For feedback or concerns regarding this content, please contact us at [email protected]

Abstract and 1. Introduction

  1. Background

    2.1 Mixture-of-Experts

    2.2 Adapters

  2. Mixture-of-Adaptations

    3.1 Routing Policy

    3.2 Consistency regularization

    3.3 Adaptation module merging and 3.4 Adaptation module sharing

    3.5 Connection to Bayesian Neural Networks and Model Ensembling

  3. Experiments

    4.1 Experimental Setup

    4.2 Key Results

    4.3 Ablation Study

  4. Related Work

  5. Conclusions

  6. Limitations

  7. Acknowledgment and References

Appendix

A. Few-shot NLU Datasets B. Ablation Study C. Detailed Results on NLU Tasks D. Hyper-parameter

4.3 Ablation Study

We perform all the ablation analysis on AdaMix with adapters for parameter-efficient fine-tuning.

\ Analysis of adaptation merging. In this ablation study, we do not merge adaptation modules and consider two different routing strategies at inference time: (a) randomly routing input to any adaptation module, and (b) fixed routing where we route all the input to the first adaptation module in AdaMix. From Table 7, we observe AdaMix with adaptation merging to perform better than any of the other variants without the merging mechanism. Notably, all of the AdaMix variants outperform full model tuning.

\ Table 3: Results on E2E NLG Challenge with GPT-2 medium backbone. Best result on each task is in bold. We report AdaMix results with both adapters and LoRA as underlying PEFT method. AdaMix outperforms all competing methods as well as fully fine-tuned large model with only 0.1% tunable parameters.† denotes results reported from (Hu et al., 2021) and repr. denotes reproduced results. #Param. denotes the number of tunable adaptation parameters used during inference. Results on DART and WebNLG presented in Tables 4 and 5 in Appendix

\ Table 4: Results on DART with GPT-2 backbone encoder. Best result on each task is in bold. We report AdaMix results with both adapters and LoRA as underlying PEFT method. AdaMix outperforms all competing methods as well as fully fine-tuned large model with only 0.1% tunable parameters.† denotes results reported from (Hu et al., 2021) and repr. denotes reproduced results. #Param. denotes the number of tunable adaptation parameters used during inference.

\ Moreover, Figure 5 shows that the performance of merging mechanism is consistently better than the average performance of random routing and comparable to the best performance of random routing.

\ \ \ Table 5: Results on WebNLG with GPT-2 medium backbone. The results are based on all categories in the test set of WebNLG. Best result on each task is in bold. We report AdaMix results with both adapters and LoRA as underlying PEFT method. AdaMix outperforms all competing methods as well as fully fine-tuned large model with only 0.1% tunable parameters.† denotes results reported from (Hu et al., 2021) and repr. denotes reproduced results. #Param. denotes the number of tunable adaptation parameters used during inference.

\

\ Analysis of consistency regularization. We drop consistency regularization during training for ablation and demonstrate significant performance degradation in Table 8.

\ Analysis of adaptation module sharing. We remove adaptation module sharing in AdaMix for ablation and keep four different copies of projectdown and four project-up FFN layers. From Table 8 we observe the performance gap between AdaMix and AdaMix w/o sharing to increase with decrease in the dataset size demonstrating the importance of parameter sharing for low-resource tasks (e.g.,

\ Table 6: Average performance and standard deviation of several parameter-efficient fine-tuning strategies based on RoBERTa-large with |K| = 30 training labels. The best performance is shown in bold. Prompt-tuning, Head-only and BitFit tune 1M model parameters during inference. Houlsby Adapter, LiST Adapter and AdaMix Adapter tune 14M model parameters. * denotes that the results are taken from (Wang et al., 2021).

\ Table 7: AdaMix without adaptation merging and different routing and ensembling strategies. Average results are presented on GLUE development set with BERTbase encoder. Detailed task results in Table 14 of Appendix for BERT-base and RoBERTa-large encoders.

\ Figure 5: Violin plot of AdaMix-RandomRouting performance distribution with RoBERTa-large encoders. Red dot denotes the performance of AdaMix.

\ Table 8: Ablation study demonstrating the impact of consistency regularization and sharing in AdaMix.

\ RTE, MRPC). This is further demonstrated in Figure 7 in Appendix which shows a faster convergence and lower training loss of AdaMix with sharing compared to that without given the same number of training steps. We explore which adaptation module to share (project-up v.s. project-down) in Table 11 in Appendix that depict similar results. Impact of the number of adaptation modules. In this study, we vary the number of adaptation modules in AdaMix as 2, 4 and 8 during training. Table 9 shows diminishing returns on aggregate task performance with increasing number of modules. As we increase sparsity and the number of tunable parameters by increasing the number of adaptation modules, low-resource tasks like RTE and SST-2 – with limited amount of labeled data for fine-tuning – degrade in performance compared to high-resource tasks like MNLI and QNLI.

\ Table 9: Varying the number of adaptation modules in AdaMix with RoBERTa-large encoder. * denotes the number of modules used in AdaMix with adapters.

\ Impact of adapter bottleneck dimension. Table 10 shows the impact of bottleneck dimension of adapters with different encoders in AdaMix. The model performance improves with increase in the number of trainable parameters by increasing the bottleneck dimension with diminishing returns after a certain point.

\

:::info Authors:

(1) Yaqing Wang, Purdue University ([email protected]);

(2) Sahaj Agarwal, Microsoft ([email protected]);

(3) Subhabrata Mukherjee, Microsoft Research ([email protected]);

(4) Xiaodong Liu, Microsoft Research ([email protected]);

(5) Jing Gao, Purdue University ([email protected]);

(6) Ahmed Hassan Awadallah, Microsoft Research ([email protected]);

(7) Jianfeng Gao, Microsoft Research ([email protected]).

:::


:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.
Tags:

You May Also Like

XRP Price News: Elon Musk Confirms X Money Crypto Plans as Pepeto’s Three Products Approach Launch and the 537x Window Stays Open

XRP Price News: Elon Musk Confirms X Money Crypto Plans as Pepeto’s Three Products Approach Launch and the 537x Window Stays Open

Elon Musk just told the world that X Money is adding crypto. When a platform with hundreds of millions of users integrates cryptocurrency, the market pays attention
Share
Techbullion2026/03/07 08:37
CME Group to Launch Solana and XRP Futures Options

CME Group to Launch Solana and XRP Futures Options

The post CME Group to Launch Solana and XRP Futures Options appeared on BitcoinEthereumNews.com. An announcement was made by CME Group, the largest derivatives exchanger worldwide, revealed that it would introduce options for Solana and XRP futures. It is the latest addition to CME crypto derivatives as institutions and retail investors increase their demand for Solana and XRP. CME Expands Crypto Offerings With Solana and XRP Options Launch According to a press release, the launch is scheduled for October 13, 2025, pending regulatory approval. The new products will allow traders to access options on Solana, Micro Solana, XRP, and Micro XRP futures. Expiries will be offered on business days on a monthly, and quarterly basis to provide more flexibility to market players. CME Group said the contracts are designed to meet demand from institutions, hedge funds, and active retail traders. According to Giovanni Vicioso, the launch reflects high liquidity in Solana and XRP futures. Vicioso is the Global Head of Cryptocurrency Products for the CME Group. He noted that the new contracts will provide additional tools for risk management and exposure strategies. Recently, CME XRP futures registered record open interest amid ETF approval optimism, reinforcing confidence in contract demand. Cumberland, one of the leading liquidity providers, welcomed the development and said it highlights the shift beyond Bitcoin and Ethereum. FalconX, another trading firm, added that rising digital asset treasuries are increasing the need for hedging tools on alternative tokens like Solana and XRP. High Record Trading Volumes Demand Solana and XRP Futures Solana futures and XRP continue to gain popularity since their launch earlier this year. According to CME official records, many have bought and sold more than 540,000 Solana futures contracts since March. A value that amounts to over $22 billion dollars. Solana contracts hit a record 9,000 contracts in August, worth $437 million. Open interest also set a record at 12,500 contracts.…
Share
BitcoinEthereumNews2025/09/18 01:39
What should investors expect from the Federal Reserve after latest jobs data?

What should investors expect from the Federal Reserve after latest jobs data?

Investors looking at the Federal Reserve after the latest jobs data got a rough answer on Friday. The labor market is getting weaker, inflation is still above the
Share
Cryptopolitan2026/03/07 08:20