Direct Preference Optimization Dpo Explained Bradley-Terry Model, Log Probabilities, Math