Though RLHF does not require massive amounts of data to improve performance, sourcing high-quality preference data is still an expensive process. Furthermore, if the data is not carefully collected from a representative sample, the resulting model may exhibit unwanted biases.

Background and motivation

Optimizing a model based on human feedback is desirable when a task is difficult to specify yet easy to judge. For example, one may want

1.Previous
3.Next