NLPrompt : Noise-Label Prompt Learning
for Vision-Language Models

🏆 CVPR 2025 Highlight

1ShanghaiTech University, Shanghai, China
2The Chinese University of Hong Kong, Shenzhen, China
3RIKEN Center for Advanced Intelligence Project, Japan
4University of Technology Sydney, Australia   5University of Melbourne, Australia
Indicates Equal Contribution   *Indicates Corresponding Author

Directional Weight Score

We addressed the critical challenge of noisy labels in prompt learning for vision-language foundation models by introducing PromptMAE and PromptOT.

Abstract

The emergence of vision-language foundation models, such as CLIP, has revolutionized image-text representation, enabling a broad range of applications via prompt learning. Despite its promise, real-world datasets often contain noisy labels that can degrade prompt learning performance. In this paper, we demonstrate that using mean absolute error (MAE) loss in prompt learning, named PromptMAE, significantly enhances robustness against noisy labels while maintaining high accuracy. Though MAE is straightforward and recognized for its robustness, it is rarely used in noisy-label learning due to its slow convergence and poor performance outside prompt learning scenarios. To elucidate the robustness of PromptMAE, we leverage feature learning theory to show that MAE can suppress the influence of noisy samples, thereby improving the signal-to-noise ratio and enhancing overall robustness. Additionally, we introduce PromptOT, a prompt-based optimal transport data purification method to enhance the robustness further. PromptOT employs text features in vision-language models as prototypes to construct an optimal transportation matrix. This matrix effectively partitions datasets into clean and noisy subsets, allowing for the application of cross-entropy loss to the clean subset and MAE loss to the noisy subset. Our Noise-Label Prompt Learning method, named NLPrompt, offers a simple and efficient approach that leverages the expressive representations and precise alignment capabilities of vision-language models for robust prompt learning. We validate NLPrompt through extensive experiments across various noise settings, demonstrating significant performance improvements.

Video (coming soon)

Motivation

In the realm of noisy label learning, mean absolute error (MAE) has been identified as a robust loss function within the traditional training paradigm. However, MAE often suffers from slow convergence and poor performance during training, making it seldom employed as a classification loss in noise-label learning.
Nevertheless, our investigation reveals an interesting phenomenon: employing MAE loss in Prompt learning (PromptMAE) notably enhances robustness while maintaining high accuracy compared to traditional cross-entropy loss. As shown in Figure, MAE exhibits strong accuracy and fast convergence even in the presence of substantial noise.

Directional Weight Score

To elucidate PromptMAE's robustness, we leverage feature learning theory to demonstrate that it can suppress the influence of noisy samples, thereby enhancing the robustness of prompt learning in vision–language models.

Experiments Results

On the synthetic noisy datasets

Directional Weight Score

For the synthetic noisy dataset, the accuracy of the image classification task under different noise levels is shown in the table below, demonstrating the effectiveness and superiority of NLPrompt in addressing the noisy label problem in prompt learning.

On the real-world noisy dataset

Directional Weight Score

The results on the real-world noisy dataset Food101N are shown in the table below, where NLPrompt outperforms all the baseline methods.

Generalization of NLPrompt

Directional Weight Score

Our method NLPrompt is effective not only for CoOp but also for other prompt learning methods, such as VPT, MaPLe, and PromptSRC, which are subsequent methods of CoOp. NLPrompt significantly improves the robustness of various prompt learning methods in the face of noisy label problems, which verifies the strong generalization ability of NLPrompt.

Ablation Studies

To evaluate the effectiveness of each component of our method, we conduct ablation studies on the Flowers102 dataset.

Directional Weight Score

The experimental design is as follows:
(a) Use CE loss for all data;
(b) Use MAE loss for all data;
(c) Use random initialization prototype instead of CLIP text feature as initialization;
(d) Use CE loss for clean data only after removing noisy data;
(e) Use MAE loss for noisy data only after removing clean data.
The average results show that (b) outperforms (a), validating the effectiveness of our PromptMAE. Moreover, the average results show that (d) outperforms (a), and (e) outperforms (b), further validating the effectiveness of PromptOT in the data purification process. Additionally, the comparison between (c) and NLPrompt highlights the importance of text feature initialization in our method.
Among all methods, our NLPrompt achieves the best performance, with significant improvements over other baselines, further validating the effectiveness of each component.

Citation

@inproceedings{pan2025nlprompt,
    title={NLPrompt: Noise-Label Prompt Learning for Vision-Language Models},
    author={Pan, Bikang and Li, Qun and Tang, Xiaoying and Huang, Wei and Fang, Zhen and Liu, Feng and Wang, Jingya and Yu, Jingyi and Shi, Ye},
    booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
    pages={19963--19973},
    year={2025}
}