SSL = create a supervised task automatically from raw data We basically hide something and ask the model to predict it, like masking words in BERT.
SSL in Vision
Relative patch position, colorization, rotation prediction… CNN SSL is hard because you cant mask pixels arbitrarily, since CNNs blur local contexte. Vision Transformers (ViT) changed this by treating image patches as tokens, enablung MAE. MAE → Masked Auto-encoders (BERT for images)
How to evaluate?
- Linear Probe
- Fine-tuning
Semi-supervised: labeled + unlabeled jointly → one model, two heads Unsupervised pre-training: first train SSL on large unlabeled set → then fine-tune on labeled set
Generative Models
GANs
It’s a way to make a generative model by having two NNs compete with each other. A generator which turns noise into fake samples, and a discriminator that tries to identify real data from fakes.
It has very sharp samples, produces high-res images, but it’s hard to train, and there’s no explicit likelihood function.
Stable Diffusion
Start with a real image and progressively apply noise, then train a model to work backwards producing less noisy images from noisy ones
