Results for "contrastive vision-language"
Joint vision-language model aligning images and text.
Models that process or generate multiple modalities, enabling vision-language tasks, speech, video understanding, etc.
AI focused on interpreting images/video: classification, detection, segmentation, tracking, and 3D understanding.
Probabilistic energy-based neural network with hidden variables.
Simplified Boltzmann Machine with bipartite structure.
A model that assigns probabilities to sequences of tokens; often trained by next-token prediction.
External sensing of surroundings (vision, audio, lidar).
A branch of ML using multi-layer neural networks to learn hierarchical representations, often excelling in vision, speech, and language.
Transformer applied to image patches.
Controlling robots via language.
AI subfield dealing with understanding and generating human language, including syntax, semantics, and pragmatics.
Identifying and localizing objects in images, often with confidence scores and bounding rectangles.
Assigning labels per pixel (semantic) or per instance (instance segmentation) to map object boundaries.
A high-capacity language model trained on massive corpora, exhibiting broad generalization and emergent behaviors.
Reusing knowledge from a source task/domain to improve learning on a target task/domain, typically via pretrained models.
Expanding training data via transformations (flips, noise, paraphrases) to improve robustness.
Probabilistic graphical model for structured prediction.
AI limited to specific domains.
The set of tokens a model can represent; impacts efficiency, multilinguality, and handling of rare strings.
Predicts masked tokens in a sequence, enabling bidirectional context; often used for embeddings rather than generation.
The text (and possibly other modalities) given to an LLM to condition its output behavior.
Crafting prompts to elicit desired behavior, often using role, structure, constraints, and examples.
Converting audio speech into text, often using encoder-decoder or transducer architectures.
Networks using convolution operations with weight sharing and locality, effective for images and signals.
Inputs crafted to cause model errors or unsafe behavior, often imperceptible in vision or subtle in text.
Assigning category labels to images.
Pixel-level separation of individual object instances.
Pixel-wise classification of image regions.
Pixel motion estimation between frames.
Simultaneous Localization and Mapping for robotics.