Category:

Safety RL (RLAIF/RLHF)

Category:

AI Alignment & Safety

Definition

Reinforcement-learning methods used to align LLMs with human values and safety rules.

Explanation

RLHF (Reinforcement Learning from Human Feedback) and RLAIF (Reinforcement Learning from AI Feedback) shape model behavior based on reward signals. They teach models to avoid unsafe responses and follow guidelines. RLAIF scales alignment by allowing AI to generate evaluations, reducing human labor.

Technical Architecture

Base Model → Human/AI Feedback → Reward Model → RL Phase → Aligned Model

Core Component

Reward model, preference dataset, safety dataset, policy model

Use Cases

Enterprise copilots, compliance-heavy workflows, public-facing AI

Pitfalls

Over-alignment reduces creativity; reward hacking issues.

LLM Keywords

RLHF, RLAIF, Reinforcement Learning LLM

Related Concepts

Related Frameworks

• Alignment
• Safety Classifiers
• Policy Enforcement

• Alignment Training Pipeline

Back to Glossary Index