We addressed the critical challenge of noisy labels in prompt learning for vision-language foundation models by introducing PromptMAE and PromptOT.
We addressed the critical challenge of noisy labels in prompt learning for vision-language foundation models by introducing PromptMAE and PromptOT.
The emergence of vision-language foundation models, such as CLIP, has revolutionized image-text representation, enabling a broad range of applications via prompt learning. Despite its promise, real-world datasets often contain noisy labels that can degrade prompt learning performance. In this paper, we demonstrate that using mean absolute error (MAE) loss in prompt learning, named PromptMAE, significantly enhances robustness against noisy labels while maintaining high accuracy. Though MAE is straightforward and recognized for its robustness, it is rarely used in noisy-label learning due to its slow convergence and poor performance outside prompt learning scenarios. To elucidate the robustness of PromptMAE, we leverage feature learning theory to show that MAE can suppress the influence of noisy samples, thereby improving the signal-to-noise ratio and enhancing overall robustness. Additionally, we introduce PromptOT, a prompt-based optimal transport data purification method to enhance the robustness further. PromptOT employs text features in vision-language models as prototypes to construct an optimal transportation matrix. This matrix effectively partitions datasets into clean and noisy subsets, allowing for the application of cross-entropy loss to the clean subset and MAE loss to the noisy subset. Our Noise-Label Prompt Learning method, named NLPrompt, offers a simple and efficient approach that leverages the expressive representations and precise alignment capabilities of vision-language models for robust prompt learning. We validate NLPrompt through extensive experiments across various noise settings, demonstrating significant performance improvements.
In the realm of noisy label learning, mean absolute error (MAE) has been identified as a robust loss function within the traditional training paradigm. However, MAE often suffers from slow convergence and poor performance during training, making it seldom employed as a classification loss in noise-label learning.
Nevertheless, our investigation reveals an interesting phenomenon: employing MAE loss in Prompt learning (PromptMAE) notably enhances robustness while maintaining high accuracy compared to traditional cross-entropy loss. As shown in Figure, MAE exhibits strong accuracy and fast convergence even in the presence of substantial noise.
To elucidate PromptMAE's robustness, we leverage feature learning theory to demonstrate that it can suppress the influence of noisy samples, thereby enhancing the robustness of prompt learning in vision–language models.
For the synthetic noisy dataset, the accuracy of the image classification task under different noise levels is shown in the table below, demonstrating the effectiveness and superiority of NLPrompt in addressing the noisy label problem in prompt learning.
The results on the real-world noisy dataset Food101N are shown in the table below, where NLPrompt outperforms all the baseline methods.
Our method NLPrompt is effective not only for CoOp but also for other prompt learning methods, such as VPT, MaPLe, and PromptSRC, which are subsequent methods of CoOp. NLPrompt significantly improves the robustness of various prompt learning methods in the face of noisy label problems, which verifies the strong generalization ability of NLPrompt.
To evaluate the effectiveness of each component of our method, we conduct ablation studies on the Flowers102 dataset.
The experimental design is as follows:
(a) Use CE loss for all data;
(b) Use MAE loss for all data;
(c) Use random initialization prototype instead of CLIP text feature as initialization;
(d) Use CE loss for clean data only after removing noisy data;
(e) Use MAE loss for noisy data only after removing clean data.
The average results show that (b) outperforms (a), validating the effectiveness of our PromptMAE.
Moreover, the average results show that (d) outperforms (a), and (e) outperforms (b), further validating the effectiveness of PromptOT in the data purification process. Additionally, the comparison between (c) and NLPrompt highlights the importance of text feature initialization in our method.
Among all methods, our NLPrompt achieves the best performance, with significant improvements over other baselines, further validating the effectiveness of each component.
@inproceedings{pan2025nlprompt,
title={NLPrompt: Noise-Label Prompt Learning for Vision-Language Models},
author={Pan, Bikang and Li, Qun and Tang, Xiaoying and Huang, Wei and Fang, Zhen and Liu, Feng and Wang, Jingya and Yu, Jingyi and Shi, Ye},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={19963--19973},
year={2025}
}