Results for "text+image+audio"
Generating speech audio from text, with control over prosody, speaker identity, and style.
Detects trigger phrases in audio streams.
Generates audio waveforms from spectrograms.
Converting audio speech into text, often using encoder-decoder or transducer architectures.
Joint vision-language model aligning images and text.
Identifying speakers in audio.
Maps audio signals to linguistic units.
Aligns transcripts with audio timestamps.
Transformer applied to image patches.
Generating human-like speech from text.
Assigning labels per pixel (semantic) or per instance (instance segmentation) to map object boundaries.
Assigning category labels to images.
Converting text into discrete units (tokens) for modeling; subword tokenizers balance vocabulary size and coverage.
Models that process or generate multiple modalities, enabling vision-language tasks, speech, video understanding, etc.
Pixel-wise classification of image regions.
A model that assigns probabilities to sequences of tokens; often trained by next-token prediction.
A high-capacity language model trained on massive corpora, exhibiting broad generalization and emergent behaviors.
External sensing of surroundings (vision, audio, lidar).
Combining signals from multiple modalities.
Architecture based on self-attention and feedforward layers; foundation of modern LLMs and many multimodal models.
Recovering 3D structure from images.
Human or automated process of assigning targets; quality, consistency, and guidelines matter heavily.
Expanding training data via transformations (flips, noise, paraphrases) to improve robustness.
Models that learn to generate samples resembling training data.
Attention between different modalities.
Generates sequences one token at a time, conditioning on past tokens.
The text (and possibly other modalities) given to an LLM to condition its output behavior.
A continuous vector encoding of an item (word, image, user) such that semantic similarity corresponds to geometric closeness.
Networks using convolution operations with weight sharing and locality, effective for images and signals.
A dataset + metric suite for comparing models; can be gamed or misaligned with real-world goals.