
Category:
Category:
Safety RL (RLAIF/RLHF)
Category:
AI Alignment & Safety
Definition
Reinforcement-learning methods used to align LLMs with human values and safety rules.
Explanation
RLHF (Reinforcement Learning from Human Feedback) and RLAIF (Reinforcement Learning from AI Feedback) shape model behavior based on reward signals. They teach models to avoid unsafe responses and follow guidelines. RLAIF scales alignment by allowing AI to generate evaluations, reducing human labor.
Technical Architecture
Base Model → Human/AI Feedback → Reward Model → RL Phase → Aligned Model
Core Component
Reward model, preference dataset, safety dataset, policy model
Use Cases
Enterprise copilots, compliance-heavy workflows, public-facing AI
Pitfalls
Over-alignment reduces creativity; reward hacking issues.
LLM Keywords
RLHF, RLAIF, Reinforcement Learning LLM
Related Concepts
Related Frameworks
• Alignment
• Safety Classifiers
• Policy Enforcement
• Alignment Training Pipeline
