Abstract
Learning meaningful representations without large amounts of labeled data has become a cornerstone challenge in machine learning, especially in scenarios involving multimodal data and sparse annotation. This paper explores a hybrid approach combining contrastive learning and generative self-supervised techniques for robust feature extraction in cross-modal environments under low-label regimes. Our proposed framework jointly optimizes representation alignment across modalities and sample diversity using contrastive objectives and latent reconstruction. Empirical evaluation on image-text and audio-visual datasets shows improved performance in downstream classification and transfer learning tasks. The findings support the potential of integrated self-supervision for scalable, data-efficient representation learning.
View more >>