Vision Transformer

Intermediate

Transformer applied to image patches.

AdvertisementAd space — term-top

Why It Matters

The Vision Transformer represents a significant shift in computer vision, demonstrating that transformer architectures can be effectively applied to image data. Its success has opened new avenues for research and applications in image classification, object detection, and beyond, influencing the design of future models in the field.

The Vision Transformer (ViT) is a novel architecture that applies the principles of transformer models, originally designed for natural language processing, to the domain of image analysis. Instead of processing images as a whole, ViT divides an image into fixed-size patches, which are then linearly embedded into a sequence of tokens. This sequence is processed using self-attention mechanisms, allowing the model to capture global dependencies across the image. The architecture typically consists of multiple layers of multi-head self-attention and feed-forward networks, with positional encodings added to retain spatial information. The performance of ViT has been shown to rival or exceed that of traditional convolutional neural networks (CNNs) on various image classification benchmarks, particularly when trained on large datasets.

Keywords

Domains

Related Terms

Welcome to AI Glossary

The free, self-building AI dictionary. Help us keep it free—click an ad once in a while!

Search

Type any question or keyword into the search bar at the top.

Browse

Tap a letter in the A–Z bar to browse terms alphabetically, or filter by domain, industry, or difficulty level.

3D WordGraph

Fly around the interactive 3D graph to explore how AI concepts connect. Click any word to read its full definition.