Search: text+image+audio

Text-to-Speech Intermediate

Generating speech audio from text, with control over prosody, speaker identity, and style.

Speech & Audio AI

Wake Word Detection Intermediate

Detects trigger phrases in audio streams.

Speech & Audio AI

Neural Vocoder Intermediate

Generates audio waveforms from spectrograms.

Speech & Audio AI

Speech Recognition Intermediate

Converting audio speech into text, often using encoder-decoder or transducer architectures.

Speech & Audio AI

Speaker Diarization Intermediate

Identifying speakers in audio.

Speech & Audio AI

CLIP Intermediate

Joint vision-language model aligning images and text.

Computer Vision

Acoustic Model Intermediate

Maps audio signals to linguistic units.

Speech & Audio AI

Forced Alignment Intermediate

Aligns transcripts with audio timestamps.

Speech & Audio AI

Speech Synthesis Intermediate

Generating human-like speech from text.

Speech & Audio AI

Vision Transformer Intermediate

Transformer applied to image patches.

Computer Vision

Multimodal Model Intermediate

Models that process or generate multiple modalities, enabling vision-language tasks, speech, video understanding, etc.

Foundations & Theory

Tokenization Intermediate

Converting text into discrete units (tokens) for modeling; subword tokenizers balance vocabulary size and coverage.

Foundations & Theory

Segmentation Intermediate

Assigning labels per pixel (semantic) or per instance (instance segmentation) to map object boundaries.

Computer Vision

Image Classification Intermediate

Assigning category labels to images.

Computer Vision

Exteroception Advanced

External sensing of surroundings (vision, audio, lidar).

Robotics & Embodied AI

Language Model Intermediate

A model that assigns probabilities to sequences of tokens; often trained by next-token prediction.

Large Language Models

Large Language Model Intermediate

A high-capacity language model trained on massive corpora, exhibiting broad generalization and emergent behaviors.

Large Language Models

Semantic Segmentation Intermediate

Pixel-wise classification of image regions.

Computer Vision

Multimodal Fusion Intermediate

Combining signals from multiple modalities.

Computer Vision

Transformer Intermediate

Architecture based on self-attention and feedforward layers; foundation of modern LLMs and many multimodal models.

Transformers & LLMs

Autoregressive Model Intermediate

Generates sequences one token at a time, conditioning on past tokens.

Foundations & Theory

Prompt Intermediate

The text (and possibly other modalities) given to an LLM to condition its output behavior.

Prompting & Instructions

Data Labeling Intermediate

Human or automated process of assigning targets; quality, consistency, and guidelines matter heavily.

Foundations & Theory

Data Augmentation Intermediate

Expanding training data via transformations (flips, noise, paraphrases) to improve robustness.

Foundations & Theory

Generative Model Advanced

Models that learn to generate samples resembling training data.

Diffusion & Generative Models

Cross-Attention Intermediate

Attention between different modalities.

Computer Vision

3D Reconstruction Intermediate

Recovering 3D structure from images.

Computer Vision

Attention Intermediate

Mechanism that computes context-aware mixtures of representations; scales well and captures long-range dependencies.

Transformers & LLMs

Next-Token Prediction Intermediate

Training objective where the model predicts the next token given previous tokens (causal modeling).

Foundations & Theory

Adversarial Example Intermediate

Inputs crafted to cause model errors or unsafe behavior, often imperceptible in vision or subtle in text.

Foundations & Theory

Results for "text+image+audio"

Welcome to AI Glossary

Search

Browse

3D WordGraph