// Vision Transformer · Image Classifier · google/vit-base-patch16-224
Free key from huggingface.co/settings/tokens — read-only scope is sufficient.
Drag & drop · paste · or choose a file
Vision Transformers (ViT) divide an image into a fixed grid of patches (16×16 pixels each), flatten them into vectors, and feed them to a standard Transformer encoder — just like words in a sentence. The model uses global self-attention from the very first layer, allowing any patch to directly attend to every other patch.
This classifier uses google/vit-base-patch16-224 — pretrained on ImageNet-21k and fine-tuned on ImageNet-1k — via the HuggingFace Inference API. The model outputs top-5 class probabilities across 1,000 ImageNet categories.