Results for "inference acceleration"
Study of motion without considering forces.
Methods to protect model/data during inference (e.g., trusted execution environments) from operators/attackers.
Model execution path in production.
Cost to run models in production.
Framework for reasoning about cause-effect relationships beyond correlation, often using structural assumptions and experiments.
Low-latency prediction per request.
Running predictions on large datasets periodically.
Differences between training and inference conditions.
Acting to minimize surprise or free energy.
Hardware resources used for training/inference; constrained by memory bandwidth, FLOPs, and parallelism.
Running models locally.
Time from request to response; critical for real-time inference and UX.
A hidden variable influences both cause and effect, biasing naive estimates of causal impact.
Reducing numeric precision of weights/activations to speed inference and reduce memory with acceptable accuracy loss.
Updating beliefs about parameters using observed evidence and prior distributions.
Prevents attention to future tokens during training/inference.
Autoencoder using probabilistic latent variables and KL regularization.
Variable enabling causal inference despite confounding.
Probability of data given parameters.
Limiting inference usage.
Converting text into discrete units (tokens) for modeling; subword tokenizers balance vocabulary size and coverage.
Controlled experiment comparing variants by random assignment to estimate causal effects of changes.
Selecting the most informative samples to label (e.g., uncertainty sampling) to reduce labeling cost.
A broader capability to infer internal system state from telemetry, crucial for AI services and agents.
How many requests or tokens can be processed per unit time; affects scalability and cost.
Attacks that infer whether specific records were in training data, or reconstruct sensitive training examples.
System design where humans validate or guide model outputs, especially for high-stakes decisions.
Measures how one probability distribution diverges from another.
Estimating parameters by maximizing likelihood of observed data.
Stores past attention states to speed up autoregressive decoding.