Results for "text+image+audio"
Generating speech audio from text, with control over prosody, speaker identity, and style.
Converting audio speech into text, often using encoder-decoder or transducer architectures.
A continuous vector encoding of an item (word, image, user) such that semantic similarity corresponds to geometric closeness.
Pixel-wise classification of image regions.
Transformer applied to image patches.
Converting text into discrete units (tokens) for modeling; subword tokenizers balance vocabulary size and coverage.
The text (and possibly other modalities) given to an LLM to condition its output behavior.
Inputs crafted to cause model errors or unsafe behavior, often imperceptible in vision or subtle in text.
Joint vision-language model aligning images and text.
Generating human-like speech from text.
Maps audio signals to linguistic units.
Aligns transcripts with audio timestamps.
Detects trigger phrases in audio streams.
Identifying speakers in audio.
Generates audio waveforms from spectrograms.
External sensing of surroundings (vision, audio, lidar).
Assigning category labels to images.