Results for "proxy exploitation"
Balancing learning new behaviors vs exploiting known rewards.
Model behaves well during training but not deployment.
Search algorithm for generation that keeps top-k partial sequences; can improve likelihood but reduce diversity.
System that independently pursues goals over time.
Using limited human feedback to guide large models.
RL using learned or known environment models.
Fast approximation of costly simulations.