AdaMix是一个新的大型预训练语言模型参数高效微调(PEFT)框架。与单一适应方法不同,AdaMix利用具有随机路由和权重合并的模块混合,在自然语言理解和生成任务中都取得了最先进的结果。通过仅调整0.1-0.2%的参数,它的表现优于完整模型微调和现有的PEFT方法(如适配器和LoRA),尽管训练成本略高。AdaMix是一个新的大型预训练语言模型参数高效微调(PEFT)框架。与单一适应方法不同,AdaMix利用具有随机路由和权重合并的模块混合,在自然语言理解和生成任务中都取得了最先进的结果。通过仅调整0.1-0.2%的参数,它的表现优于完整模型微调和现有的PEFT方法(如适配器和LoRA),尽管训练成本略高。

仅用 0.2% 的参数击败全面微调

2025/10/02 15:00

摘要和1. 引言

  1. 背景

    2.1 专家混合

    2.2 适配器

  2. 适配混合

    3.1 路由策略

    3.2 一致性正则化

    3.3 适配模块合并和3.4 适配模块共享

    3.5 与贝叶斯神经网络和模型集成的联系

  3. 实验

    4.1 实验设置

    4.2 主要结果

    4.3 消融研究

  4. 相关工作

  5. 结论

  6. 局限性

  7. 致谢和参考文献

附录

A. 少样本NLU数据集 B. 消融研究 C. NLU任务的详细结果 D. 超参数

5 相关工作

预训练语言模型的参数高效微调。 最近关于参数高效微调(PEFT)的工作大致可分为两类:

表10:在RoBERTa-large编码器中AdaMix适配器的瓶颈维度变化。*表示AdaMix中使用的适配器瓶颈维度。附录表12中有BERT-base编码器的结果。

类别:(1)调整现有参数的子集,包括头部微调(Lee等,2019),偏置项调整(Zaken等,2021),(2)调整新引入的参数,包括适配器(Houlsby等,2019; Pfeiffer等,2020),提示调整(Lester等,2021),前缀调整(Li和Liang,2021)和低秩适配(Hu等,2021)。与之前在单个适配模块上操作的工作不同,AdaMix引入了适配模块混合,在训练期间使用随机路由,在推理期间合并适配模块,以保持与单个模块相同的计算成本。此外,AdaMix可以应用于任何PEFT方法之上,进一步提升其性能。

专家混合(MoE)。 Shazeer等,2017引入了具有单一门控网络的MoE模型,采用Top-k路由和专家间负载均衡。Fedus等,2021提出了Top-1路由的初始化和训练方案。Zuo等,2021提出了随机路由的一致性正则化;Yang等,2021提出了带有专家原型的k Top-1路由,而Roller等,2021; Lewis等,2021解决了其他负载均衡问题。以上所有工作都研究了从头预训练整个模型的稀疏MoE。相比之下,我们研究的是通过仅调整极少量稀疏适配器参数来实现预训练语言模型的参数高效适配。

模型权重平均。 最近的探索(Szegedy等,2016; Matena和Raffel,2021; Wortsman等,2022; Izmailov等,2018)研究了通过平均所有模型权重进行模型聚合。(Matena和Raffel,2021)提出合并在各种文本分类任务上微调的预训练语言模型。(Wortsman等,2022)探索了在同一任务上使用不同超参数配置的各种独立运行的模型权重平均。与上述关于全模型微调的工作不同,我们专注于参数高效微调。我们探索权重平均用于合并适配模块的权重,这些模块由在模型调整过程中更新的小型可调参数组成,同时保持大型模型参数固定。

6 结论

我们开发了一个新的框架AdaMix,用于大型预训练语言模型(PLM)的参数高效微调(PEFT)。AdaMix利用适配模块混合来提高下游任务性能,而不增加底层适配方法的计算成本(如FLOPs、参数)。我们证明AdaMix可以与不同的PEFT方法(如适配器和低秩分解)一起工作并改进它们,适用于NLU和NLG任务。

通过仅调整PLM参数的0.1-0.2%,AdaMix优于更新所有模型参数的全模型微调以及其他最先进的PEFT方法。

7 局限性

所提出的AdaMix方法在某种程度上计算密集,因为它涉及大规模语言模型的微调。由于训练过程涉及多个适配器副本,所提出的AdaMix的训练成本高于标准PEFT方法。根据我们的经验观察,AdaMix的训练迭代次数通常是标准PEFT方法训练的1~2倍。这对训练所述模型的碳足迹产生了负面影响。

AdaMix与大多数现有的参数高效微调(PEFT)研究正交,并且能够潜在地提高任何PEFT方法的性能。在本工作中,我们探索了两种代表性PEFT方法,如适配器和LoRA,但我们没有尝试其他组合,如提示调整和前缀调整。我们将这些研究留给未来的工作。

8 致谢

作者感谢匿名审稿人的宝贵意见和有益建议,并感谢Guoqing Zheng和Ruya Kang对项目的深刻见解。本工作部分由美国国家科学基金会资助,资助号为NSF-IIS 1747614和NSF-IIS-2141037。本材料中表达的任何意见、发现和结论或建议均为作者的观点,不一定反映国家科学基金会的观点。

参考文献

Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. 2021. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7319– 7328, Online. Association for Computational Linguistics.

Roy Bar Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. 2006. The second PASCAL recognising textual entailment challenge.

Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. 2009. The fifth PASCAL recognizing textual entailment challenge. In TAC.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel HerbertVoss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.

Ido Dagan, Oren Glickman, and Bernardo Magnini. 2005. The PASCAL recognising textual entailment challenge. In the First International Conference on Machine Learning Challenges: Evaluating Predictive Uncertainty Visual Object Classification, and Recognizing Textual Entailment.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Volume 1 (Long and Short Papers), pages 4171–4186.

William Fedus, Barret Zoph, and Noam Shazeer. 2021. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. arXiv preprint arXiv:2101.03961.

Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. 2020. Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pages 3259–3269. PMLR.

Yarin Gal and Zoubin Ghahramani. 2015. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. CoRR, abs/1506.02142.

Yarin Gal, Riashat Islam, and Zoubin Ghahramani. 2017. Deep Bayesian active learning with image data. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1183–1192. PMLR.

Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Making pre-trained language models better few-shot learners. In Association for Computational Linguistics (ACL).

Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. The webnlg challenge: Generating text from rdf data. In Proceedings of the 10th International Conference on Natural Language Generation, pages 124–133.

Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. 2007. The third PASCAL recognizing textual entailment challenge. In the ACLPASCAL Workshop on Textual Entailment and Paraphrasing.

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR.

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.

Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. 2018. Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407.

Jaejun Lee, Raphael Tang, and Jimmy Lin. 2019. What would elsa do? freezing layers during transformer fine-tuning. arXiv preprint arXiv:1911.03090.

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668.

Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. CoRR, abs/2104.08691.

Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. 2021. Base layers: Simplifying training of large, sparse models. In ICML.

Xiang Lisa Li and Percy Liang. 2021. Prefixtuning: Optimizing continuous prompts for generation. CoRR, abs/2101.00190.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.

Yuning Mao, Lambert Mathias, Rui Hou, Amjad Almahairi, Hao Ma, Jiawei Han, Wen-tau Yih, and Madian Khabsa. 2021. Unipelt: A unified framework for parameter-efficient language model tuning. arXiv preprint arXiv:2110.07577.

Michael Matena and Colin Raffel. 2021. Merging models with fisher-weighted averaging. arXiv preprint arXiv:2111.09832.

Linyong Nan, Dragomir Radev, Rui Zhang, Amrit Rau, Abhinand Sivaprasad, Chiachun Hsieh, Xiangru Tang

免责声明: 本网站转载的文章均来源于公开平台,仅供参考。这些文章不代表 MEXC 的观点或意见。所有版权归原作者所有。如果您认为任何转载文章侵犯了第三方权利,请联系 [email protected] 以便将其删除。MEXC 不对转载文章的及时性、准确性或完整性作出任何陈述或保证,并且不对基于此类内容所采取的任何行动或决定承担责任。转载材料仅供参考,不构成任何商业、金融、法律和/或税务决策的建议、认可或依据。