Results for "tokens/sec"
Predicts masked tokens in a sequence, enabling bidirectional context; often used for embeddings rather than generation.
Detecting unauthorized model outputs or data leaks.
A model that assigns probabilities to sequences of tokens; often trained by next-token prediction.
Samples from the k highest-probability tokens to limit unlikely outputs.
Samples from the smallest set of tokens whose probabilities sum to p, adapting set size by context.
Injects sequence order into Transformers, since attention alone is permutation-invariant.
Training objective where the model predicts the next token given previous tokens (causal modeling).
Generates sequences one token at a time, conditioning on past tokens.
Maximum number of tokens the model can attend to in one forward pass; constrains long-document reasoning.
How many requests or tokens can be processed per unit time; affects scalability and cost.
Prevents attention to future tokens during training/inference.
Encodes positional information via rotation in embedding space.
Encodes token position explicitly, often via sinusoids.
Attention mechanisms that reduce quadratic complexity.
Mechanism that computes context-aware mixtures of representations; scales well and captures long-range dependencies.
Converting text into discrete units (tokens) for modeling; subword tokenizers balance vocabulary size and coverage.
The set of tokens a model can represent; impacts efficiency, multilinguality, and handling of rare strings.
Stochastic generation strategies that trade determinism for diversity; key knobs include temperature and nucleus sampling.
Techniques to handle longer documents without quadratic cost.
Limiting inference usage.
Architecture based on self-attention and feedforward layers; foundation of modern LLMs and many multimodal models.
Search algorithm for generation that keeps top-k partial sequences; can improve likelihood but reduce diversity.
Stores past attention states to speed up autoregressive decoding.
Transformer applied to image patches.
Cost to run models in production.