ViT Vision

// Vision Transformer · Image Classifier · google/vit-base-patch16-224

HuggingFace Inference API

Free key from huggingface.co/settings/tokens — read-only scope is sufficient.

🔍

Drop an image to classify

Drag & drop · paste · or choose a file

CLASSIFICATION RESULTS
Preview
Predictions vit-base-patch16-224

Try a sample

🐱
🐶
🦜
🚗
🌻

How it works

Vision Transformers (ViT) divide an image into a fixed grid of patches (16×16 pixels each), flatten them into vectors, and feed them to a standard Transformer encoder — just like words in a sentence. The model uses global self-attention from the very first layer, allowing any patch to directly attend to every other patch.

This classifier uses google/vit-base-patch16-224 — pretrained on ImageNet-21k and fine-tuned on ImageNet-1k — via the HuggingFace Inference API. The model outputs top-5 class probabilities across 1,000 ImageNet categories.

Patch size: 16×16 Input: 224×224 Patches: 196 + 1 CLS Embed dim: 768 Heads: 12 Layers: 12 Params: 86M Classes: 1,000