ViT Vision

How it works

Vision Transformers (ViT) divide an image into a fixed grid of patches (16×16 pixels each), flatten them into vectors, and feed them to a standard Transformer encoder — just like words in a sentence. The model uses global self-attention from the very first layer, allowing any patch to directly attend to every other patch.

This classifier uses google/vit-base-patch16-224 — pretrained on ImageNet-21k and fine-tuned on ImageNet-1k — via the HuggingFace Inference API. The model outputs top-5 class probabilities across 1,000 ImageNet categories.

Patch size: 16×16 Input: 224×224 Patches: 196 + 1 CLS Embed dim: 768 Heads: 12 Layers: 12 Params: 86M Classes: 1,000

Drop an image to classify

Try a sample

How it works