Results for "contrastive vision-language"
A branch of ML using multi-layer neural networks to learn hierarchical representations, often excelling in vision, speech, and language.
Models that process or generate multiple modalities, enabling vision-language tasks, speech, video understanding, etc.
Joint vision-language model aligning images and text.
A high-capacity language model trained on massive corpora, exhibiting broad generalization and emergent behaviors.
Controlling robots via language.
Inputs crafted to cause model errors or unsafe behavior, often imperceptible in vision or subtle in text.
Devices measuring physical quantities (vision, lidar, force, IMU, etc.).
External sensing of surroundings (vision, audio, lidar).
The field of building systems that perform tasks associated with human intelligence—perception, reasoning, language, planning, and decision-making—via algori...
Penalizes confident wrong predictions heavily; standard for classification and language modeling.
AI subfield dealing with understanding and generating human language, including syntax, semantics, and pragmatics.
AI focused on interpreting images/video: classification, detection, segmentation, tracking, and 3D understanding.
Transformer applied to image patches.
A model that assigns probabilities to sequences of tokens; often trained by next-token prediction.
Predicts masked tokens in a sequence, enabling bidirectional context; often used for embeddings rather than generation.