Back to Top

Paper Title

FROM COMPLEXITY TO CLARITY: ONE-STEP PREFERENCE OPTIMIZATION FOR HIGH-PERFORMANCE LLMS

Keywords

  • Large Reinforcement Learning
  • Large Language Models
  • Preference Optimization
  • Supervised Fine-tuning
  • ORPO
  • GRPO

Article Type

Research Article

Issue

Volume : 4 | Issue : 1 | Page No : 112-125

Published On

April, 2025

Downloads

Abstract

Large Language Models (LLMs) have transformed natural language processing, achieving state-of-the-art results in text generation, reasoning, and problem-solving. Despite these advances, aligning LLM outputs with nuanced human preferences remains a challenge, hindered by the inefficiencies and instability of traditional reinforcement learning (RL) methods such as Proximal Policy Optimization (PPO). These multi-stage pipelines often introduce high computational costs and degrade core model capabilities. This paper proposes two unified RL-based algorithms, Odds Ratio Preference Optimization (ORPO) and Group Relative Policy Optimization (GRPO) which combine supervised fine-tuning (SFT) and preference alignment into a single training phase. This integrated approach eliminates the need for separate reward models and sequential stages, significantly reducing the risk of catastrophic forgetting while enhancing training efficiency. Empirical evaluations on Mistral-7B and Llama-3-8B across six benchmarks (MMLU, MATH, GSM8K, HumanEval, BIG-bench, and TruthfulQA) show that ORPO outperforms PPO, achieving a 23% improvement in reasoning tasks and a 37% reduction in training time. Lyapunov-based theoretical analysis provides stability guarantees, and efficient implementations with LoRA and 8-bit quantization enable scalable fine-tuning on consumer-grade hardware. Key challenges identified include sensitivity to noisy preference annotations (causing up to 18% accuracy loss), underperformance in non-Latin languages, and risk of bias amplification. Additionally, robust detection systems using distilled BERT models support transparency and mitigate misuse of LLM-generated content. Notably, ORPO’s streamlined architecture reduces carbon emissions by 41% compared to PPO, promoting sustainable model development. By uniting theoretical rigor with practical scalability, this work introduces a robust framework for LLM alignment that advances accuracy, efficiency, and ethical deployment, laying the foundation for the next generation of human-aligned AI systems.

View more >>

Uploded Document Preview