Results for "adversarial probing"
Stress-testing models for failures, vulnerabilities, policy violations, and harmful behaviors before release.
Reconstructing a model or its capabilities via API queries or leaked artifacts.
Inputs crafted to cause model errors or unsafe behavior, often imperceptible in vision or subtle in text.
Market reacting strategically to AI.
Two-network setup where generator fools a discriminator.
The internal space where learned representations live; operations here often correlate with semantics or generative factors.
Systematic differences in model outcomes across groups; arises from data, labels, and deployment context.
Methods to protect model/data during inference (e.g., trusted execution environments) from operators/attackers.
Artificially created data used to train/test models; helpful for privacy and coverage, risky if unrealistic.
Models that learn to generate samples resembling training data.
Hidden behavior activated by specific triggers, causing targeted mispredictions or undesired outputs.
Generator produces limited variety of outputs.
Generative model that learns to reverse a gradual noise process.
Changing speaker characteristics while preserving content.
Generates audio waveforms from spectrograms.
Model exploits poorly specified objectives.
Maximizing reward without fulfilling real goal.
Ensuring learned behavior matches intended objective.
Maintaining alignment under new conditions.
Model relies on irrelevant signals.
Modeling environment evolution in latent space.
Unequal performance across demographic groups.
Agents have opposing objectives.