NLPrompt : Noise-Label Prompt Learning
for Vision-Language Models

🏆 CVPR 2025 Highlight

1ShanghaiTech University, Shanghai, China
2The Chinese University of Hong Kong, Shenzhen, China
3RIKEN Center for Advanced Intelligence Project, Japan
4University of Technology Sydney, Australia   5University of Melbourne, Australia
Indicates Equal Contribution   *Indicates Corresponding Author

NLPrompt Framework Overview

We introduce NLPrompt, a simple yet effective framework to tackle the critical challenge of noisy labels in prompt learning for vision-language models. Our approach combines a robust loss function (PromptMAE) with an Optimal Transport-based data purification method (PromptOT) to achieve state-of-the-art performance.

Abstract

The emergence of vision-language foundation models, such as CLIP, has revolutionized image-text representation, enabling a broad range of applications via prompt learning. Despite its promise, real-world datasets often contain noisy labels that can degrade prompt learning performance. In this paper, we demonstrate that using mean absolute error (MAE) loss in prompt learning, named PromptMAE, significantly enhances robustness against noisy labels while maintaining high accuracy. Though MAE is straightforward and recognized for its robustness, it is rarely used in noisy-label learning due to its slow convergence and poor performance outside prompt learning scenarios. To elucidate the robustness of PromptMAE, we leverage feature learning theory to show that MAE can suppress the influence of noisy samples, thereby improving the signal-to-noise ratio and enhancing overall robustness. Additionally, we introduce PromptOT, a prompt-based optimal transport data purification method to enhance the robustness further. PromptOT employs text features in vision-language models as prototypes to construct an optimal transportation matrix. This matrix effectively partitions datasets into clean and noisy subsets, allowing for the application of cross-entropy loss to the clean subset and MAE loss to the noisy subset. Our Noise-Label Prompt Learning method, named NLPrompt, offers a simple and efficient approach that leverages the expressive representations and precise alignment capabilities of vision-language models for robust prompt learning. We validate NLPrompt through extensive experiments across various noise settings, demonstrating significant performance improvements.

Motivation

In noisy label learning, Mean Absolute Error (MAE) is known for its robustness but is often overlooked due to slow convergence and poor performance in traditional training. However, we discovered a surprising phenomenon: when applied to prompt learning (which we term PromptMAE), MAE not only enhances robustness against label noise but also maintains high accuracy and fast convergence, significantly outperforming the standard Cross-Entropy (CE) loss, especially under high noise conditions.

Performance of MAE vs. CE loss under different noise rates

This motivated us to explore why MAE works so well in this context. Through theoretical analysis using feature learning theory, we demonstrate that PromptMAE can effectively suppress the influence of noisy samples, thereby boosting the model's overall robustness.

Method

Our method, NLPrompt, harmonizes the strengths of both MAE and CE loss through a novel data purification strategy called PromptOT. The process involves two main steps:

  1. Data Purification with PromptOT: We leverage the strong alignment between image features and text features in pre-trained models like CLIP. Instead of using random prototypes, we use the text features as semantic prototypes for Optimal Transport. The negative log-similarity between image and text features serves as the cost matrix to solve for an optimal transport matrix, which partitions the dataset into clean and noisy subsets. The OT problem is formulated as: $$ \min_{\mathbf{Q} \in \mathbb{R}^{C \times N}} \langle -\log(\mathbf{T}\mathbf{I}^\top), \mathbf{Q} \rangle \quad \text{s.t.} \quad \mathbf{Q}\mathbf{1}_N = \frac{1}{C}\mathbf{1}_C, \quad \mathbf{Q}^\top\mathbf{1}_C = \frac{1}{N}\mathbf{1}_N $$
  2. Harmonized Loss Function: Recognizing that CE loss excels on clean data while MAE is more robust to noise, we apply a dual-loss strategy. The clean subset is trained with CE loss, while the noisy subset is trained with MAE loss to ensure robustness. The final harmonized loss is expressed as: $$ \mathcal{L}_{\text{NLPrompt}} = \sum_{i \in \mathcal{D}_{\text{clean}}} - \mathbf{y}_i^\top \log(\mathbf{s}_i) + \sum_{j \in \mathcal{D}_{\text{noisy}}} ||\mathbf{y}_j - \mathbf{s}_j||_1. $$

This simple and efficient approach fully utilizes the expressive representation and precise alignment capabilities of vision-language models for robust prompt learning.

Experimental Results

Performance on Synthetic Noisy Datasets

Results on synthetic noisy datasets

On seven benchmark datasets with varying levels of symmetric and asymmetric noise, NLPrompt consistently achieves state-of-the-art performance, demonstrating significant improvements over existing methods, especially in high-noise scenarios.

Performance on Real-World Noisy Dataset

Results on Food101N

On the real-world noisy dataset Food101N, NLPrompt outperforms all baseline methods, further validating its effectiveness and practical applicability.

Generalization of NLPrompt

Generalization results on EuroSAT

NLPrompt is not limited to CoOp. It can be seamlessly integrated with other advanced prompt tuning methods like VPT, MaPLe, and PromptSRC, consistently boosting their robustness against noisy labels and showcasing its strong generalization capability.

Ablation Studies

We conducted ablation studies on the Flowers102 dataset to validate the effectiveness of each component in our framework.

Ablation study results

The key findings are:

  • Using MAE loss for all data (b) outperforms using CE loss for all data (a), confirming the effectiveness of PromptMAE.
  • Separately training on the purified clean subset (d) and noisy subset (e) yields better results than training on the whole dataset, validating the efficacy of our PromptOT purification process.
  • Our full NLPrompt method significantly outperforms a variant using random initialization (c), highlighting the importance of using semantic text features as OT prototypes.
Overall, the results confirm that every component of NLPrompt contributes to its superior performance.

Citation

@inproceedings{pan2025nlprompt,
    title={NLPrompt: Noise-Label Prompt Learning for Vision-Language Models},
    author={Pan, Bikang and Li, Qun and Tang, Xiaoying and Huang, Wei and Fang, Zhen and Liu, Feng and Wang, Jingya and Yu, Jingyi and Shi, Ye},
    booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
    pages={19963--19973},
    year={2025}
}