Results for "attention weights"
A single attention mechanism within multi-head attention.
Attention mechanisms that reduce quadratic complexity.
Mechanism that computes context-aware mixtures of representations; scales well and captures long-range dependencies.
Attention where queries/keys/values come from the same sequence, enabling token-to-token interactions.
GNN using attention to weight neighbor contributions dynamically.
Attention between different modalities.
Architecture based on self-attention and feedforward layers; foundation of modern LLMs and many multimodal models.
Models whose weights are publicly available.
Allows model to attend to information from different subspaces simultaneously.
Prevents attention to future tokens during training/inference.
Methods to set starting weights to preserve signal/gradient scales across layers.
Removing weights or neurons to shrink models and improve efficiency; can be structured or unstructured.
Techniques to handle longer documents without quadratic cost.
Using markers to isolate context segments.
Techniques that fine-tune small additional components rather than all weights to reduce compute and storage.
Reducing numeric precision of weights/activations to speed inference and reduce memory with acceptable accuracy loss.
Using same parameters across different parts of a model.
Models accessible only via service APIs.
Updating a pretrained model’s weights on task-specific data to improve performance or adapt style/behavior.
Injects sequence order into Transformers, since attention alone is permutation-invariant.
Studying internal mechanisms or input influence on outputs (e.g., saliency maps, SHAP, attention analysis).
Stores past attention states to speed up autoregressive decoding.
Transformer applied to image patches.
Reusing knowledge from a source task/domain to improve learning on a target task/domain, typically via pretrained models.
A parameterized mapping from inputs to outputs; includes architecture + learned parameters.
The learned numeric values of a model adjusted during training to minimize a loss function.
Controls the size of parameter updates; too high diverges, too low trains slowly or gets stuck.
A parameterized function composed of interconnected units organized in layers with nonlinear activations.
Gradients grow too large, causing divergence; mitigated by clipping, normalization, careful init.
Logging hyperparameters, code versions, data snapshots, and results to reproduce and compare experiments.