Results for "Shapley values"
Feature attribution method grounded in cooperative game theory for explaining predictions in tabular settings.
Studying internal mechanisms or input influence on outputs (e.g., saliency maps, SHAP, attention analysis).
Legal or policy requirement to explain AI decisions.
Credit models with interpretable logic.
Requirement to provide explanations.
Agents optimize collective outcomes.
Model optimizes objectives misaligned with human values.
Variable whose values depend on chance.
The learned numeric values of a model adjusted during training to minimize a loss function.
A function measuring prediction error (and sometimes calibration), guiding gradient-based optimization.
Scalar summary of ROC; measures ranking ability, not calibration.
Average of squared residuals; common regression objective.
Attention where queries/keys/values come from the same sequence, enabling token-to-token interactions.
Predicting future values from past observations.
Expected return of taking action in a state.
Ensuring AI systems pursue intended human goals.
Probability of data given parameters.
Correctly specifying goals.
Inferring and aligning with human preferences.
A measurable property or attribute used as model input (raw or engineered), such as age, pixel intensity, or token ID.
A scalar measure optimized during training, typically expected loss over data, sometimes with regularization terms.
Often more informative than ROC on imbalanced datasets; focuses on positive class performance.
Plots true positive rate vs false positive rate across thresholds; summarizes separability.
A proper scoring rule measuring squared error of predicted probabilities for binary outcomes.
Nonlinear functions enabling networks to approximate complex mappings; ReLU variants dominate modern DL.
Methods to set starting weights to preserve signal/gradient scales across layers.
Architecture based on self-attention and feedforward layers; foundation of modern LLMs and many multimodal models.
Mechanism that computes context-aware mixtures of representations; scales well and captures long-range dependencies.
Reinforcement learning from human feedback: uses preference data to train a reward model and optimize the policy.
Ensuring model behavior matches human goals, norms, and constraints, including reducing harmful or deceptive outputs.