Results for "contrastive vision-language"
Recovering 3D structure from images.
Monte Carlo method for state estimation.
Devices measuring physical quantities (vision, lidar, force, IMU, etc.).
Software pipeline converting raw sensor data into structured representations.
Artificial sensor data generated in simulation.
Perceived actions an environment allows.
Interpreting human gestures.
Penalizes confident wrong predictions heavily; standard for classification and language modeling.
Maximum number of tokens the model can attend to in one forward pass; constrains long-document reasoning.
Letting an LLM call external functions/APIs to fetch data, compute, or take actions, improving reliability.
Model-generated content that is fluent but unsupported by evidence or incorrect; mitigated by grounding and verification.
Fine-tuning on (prompt, response) pairs to align a model with instruction-following behaviors.
Automated detection/prevention of disallowed outputs (toxicity, self-harm, illegal instruction, etc.).
Exponential of average negative log-likelihood; lower means better predictive fit, not necessarily better utility.
Models trained to decide when to call tools.
Aligns transcripts with audio timestamps.
Temporal and pitch characteristics of speech.
Task instruction without examples.
Assigning a role or identity to the model.
AI supporting legal research, drafting, and analysis.
The field of building systems that perform tasks associated with human intelligence—perception, reasoning, language, planning, and decision-making—via algori...
Updating a pretrained model’s weights on task-specific data to improve performance or adapt style/behavior.
Using markers to isolate context segments.
Attention where queries/keys/values come from the same sequence, enabling token-to-token interactions.
Networks with recurrent connections for sequences; largely supplanted by Transformers for many tasks.
Injects sequence order into Transformers, since attention alone is permutation-invariant.
An RNN variant using gates to mitigate vanishing gradients and capture longer context.
Converting text into discrete units (tokens) for modeling; subword tokenizers balance vocabulary size and coverage.
Architecture based on self-attention and feedforward layers; foundation of modern LLMs and many multimodal models.
Training objective where the model predicts the next token given previous tokens (causal modeling).