Results for "max tokens"
Predicts masked tokens in a sequence, enabling bidirectional context; often used for embeddings rather than generation.
Activation max(0, x); improves gradient flow and training speed in deep nets.
Detecting unauthorized model outputs or data leaks.
Samples from the k highest-probability tokens to limit unlikely outputs.
A model that assigns probabilities to sequences of tokens; often trained by next-token prediction.
Samples from the smallest set of tokens whose probabilities sum to p, adapting set size by context.
Injects sequence order into Transformers, since attention alone is permutation-invariant.
Training objective where the model predicts the next token given previous tokens (causal modeling).
Generates sequences one token at a time, conditioning on past tokens.
Maximum number of tokens the model can attend to in one forward pass; constrains long-document reasoning.
How many requests or tokens can be processed per unit time; affects scalability and cost.
Prevents attention to future tokens during training/inference.
Encodes positional information via rotation in embedding space.
Encodes token position explicitly, often via sinusoids.
Attention mechanisms that reduce quadratic complexity.
A point where gradient is zero but is neither a max nor min; common in deep nets.
Nonlinear functions enabling networks to approximate complex mappings; ReLU variants dominate modern DL.
GNN framework where nodes iteratively exchange and aggregate messages from neighbors.
Mechanism that computes context-aware mixtures of representations; scales well and captures long-range dependencies.
Converting text into discrete units (tokens) for modeling; subword tokenizers balance vocabulary size and coverage.
The set of tokens a model can represent; impacts efficiency, multilinguality, and handling of rare strings.
Stochastic generation strategies that trade determinism for diversity; key knobs include temperature and nucleus sampling.
Techniques to handle longer documents without quadratic cost.
Limiting inference usage.
Architecture based on self-attention and feedforward layers; foundation of modern LLMs and many multimodal models.
Search algorithm for generation that keeps top-k partial sequences; can improve likelihood but reduce diversity.
Stores past attention states to speed up autoregressive decoding.
Transformer applied to image patches.
Cost to run models in production.