Vision Transformer with Contrastive Learning for Remote Sensing Image Scene Classification
B. Naveen, Dr. N. Srihari Rao
PG Scholar, Department of CSE, Guru Nanak Institutions Technical Campus, Hyderabad.
Professor, Department of CSE, Guru Nanak Institutions Technical Campus, Hyderabad.
Abstract: Ground object formations and intricate spatial layouts are characteristics of remote sensing images (RSIs). Because ViT can collect long-range interactive information across patches of input photos, it may be a useful option for scene classification. However, ViT is unable to generalize effectively when trained on insufficient quantities of data since it lacks several of the inductive biases that CNNs are known for, such as locality and translation equivariance. Transferring a large-scale pretrained ViT is more cost-effective and performs better, even with small-scale target data, than training one from start. Despite being widely used in scene classification, the cross-entropy (CE) loss performs poorly in generalization across scenes and is not very robust to noise labels. The proposed ViT-CL model combines supervised contrastive learning (CL) with a ViT-based model. Developed by expanding the self-supervised contrastive approach to the fully supervised context, supervised contrastive (SupCon) loss for CL may explore the label information of RSIs in embedding space and enhance the robustness to common image corruption. A joint loss function that combines SupCon loss and CE loss is created in ViT-CL to encourage the model to learn more discriminative features. Additionally, a two-stage optimization framework is presented to improve the controllability of the ViT-CL model's optimization procedure. Comprehensive tests on the AID, NWPU-RESISC45, and UCM datasets confirmed ViT-CL's better performance, with the greatest accuracies of 97.42%, 94.54%, and 99.76%, respectively, among all competing approaches.
Keywords: Vision Transformer, Contrastive Learning, Remote Sensing, Scene Classification, Self-Supervised Learning.