Results for "text to audio"
Generating speech audio from text, with control over prosody, speaker identity, and style.
Detects trigger phrases in audio streams.
Generates audio waveforms from spectrograms.
Converting audio speech into text, often using encoder-decoder or transducer architectures.
Identifying speakers in audio.
Maps audio signals to linguistic units.
Aligns transcripts with audio timestamps.
Generating human-like speech from text.
Converting text into discrete units (tokens) for modeling; subword tokenizers balance vocabulary size and coverage.
Joint vision-language model aligning images and text.
Models that process or generate multiple modalities, enabling vision-language tasks, speech, video understanding, etc.
A model that assigns probabilities to sequences of tokens; often trained by next-token prediction.
A high-capacity language model trained on massive corpora, exhibiting broad generalization and emergent behaviors.
External sensing of surroundings (vision, audio, lidar).
Combining signals from multiple modalities.
Architecture based on self-attention and feedforward layers; foundation of modern LLMs and many multimodal models.
Generates sequences one token at a time, conditioning on past tokens.
The text (and possibly other modalities) given to an LLM to condition its output behavior.
Mechanism that computes context-aware mixtures of representations; scales well and captures long-range dependencies.
Training objective where the model predicts the next token given previous tokens (causal modeling).
Human or automated process of assigning targets; quality, consistency, and guidelines matter heavily.
Expanding training data via transformations (flips, noise, paraphrases) to improve robustness.
Inputs crafted to cause model errors or unsafe behavior, often imperceptible in vision or subtle in text.
Attacks that manipulate model instructions (especially via retrieved content) to override system goals or exfiltrate data.
AI subfield dealing with understanding and generating human language, including syntax, semantics, and pragmatics.
Prevents attention to future tokens during training/inference.
Stores past attention states to speed up autoregressive decoding.
Models that learn to generate samples resembling training data.
Attention between different modalities.
AI supporting legal research, drafting, and analysis.