Results for "text to audio"
Generating speech audio from text, with control over prosody, speaker identity, and style.
Converting audio speech into text, often using encoder-decoder or transducer architectures.
Converting text into discrete units (tokens) for modeling; subword tokenizers balance vocabulary size and coverage.
The text (and possibly other modalities) given to an LLM to condition its output behavior.
Inputs crafted to cause model errors or unsafe behavior, often imperceptible in vision or subtle in text.
Joint vision-language model aligning images and text.
Generating human-like speech from text.
Maps audio signals to linguistic units.
Aligns transcripts with audio timestamps.
Detects trigger phrases in audio streams.
Identifying speakers in audio.
Generates audio waveforms from spectrograms.
External sensing of surroundings (vision, audio, lidar).