davila7 / verl-rl-training

Install for your project team

Run this command in your project directory to install the skill for your entire team:

mkdir -p .claude/skills/verl-rl-training && curl -L -o skill.zip "https://fastmcp.me/Skills/Download/2964" && unzip -o skill.zip -d .claude/skills/verl-rl-training && rm skill.zip

New-Item -Path ".claude/skills/verl-rl-training" -ItemType Directory -Force; Invoke-WebRequest -Uri "https://fastmcp.me/Skills/Download/2964" -OutFile "skill.zip"; Expand-Archive -Path "skill.zip" -DestinationPath ".claude/skills/verl-rl-training" -Force; Remove-Item "skill.zip"

Project Skills

This skill will be saved in .claude/skills/verl-rl-training/ and checked into git. All team members will have access to it automatically.

Important: Please verify the skill by reviewing its instructions before using it.

Install skill for Codex

Run one of these commands to install the skill depending on your needs:

Project Local ($CWD/.codex/skills)

mkdir -p .codex/skills/verl-rl-training && curl -L -o skill.zip "https://fastmcp.me/Skills/Download/2964" && unzip -o skill.zip -d .codex/skills/verl-rl-training && rm skill.zip

New-Item -Path ".codex/skills/verl-rl-training" -ItemType Directory -Force; Invoke-WebRequest -Uri "https://fastmcp.me/Skills/Download/2964" -OutFile "skill.zip"; Expand-Archive -Path "skill.zip" -DestinationPath ".codex/skills/verl-rl-training" -Force; Remove-Item "skill.zip"

User Global (~/.codex/skills)

mkdir -p ~/.codex/skills/verl-rl-training && curl -L -o skill.zip "https://fastmcp.me/Skills/Download/2964" && unzip -o skill.zip -d ~/.codex/skills/verl-rl-training && rm skill.zip

New-Item -Path "$HOME/.codex/skills/verl-rl-training" -ItemType Directory -Force; Invoke-WebRequest -Uri "https://fastmcp.me/Skills/Download/2964" -OutFile "skill.zip"; Expand-Archive -Path "skill.zip" -DestinationPath "$HOME/.codex/skills/verl-rl-training" -Force; Remove-Item "skill.zip"

Scope	Location	Suggested Use
REPO	`$CWD/.codex/skills`	Project directory. Teams can check in skills most relevant to a working folder here.
REPO	`$CWD/../.codex/skills`	A folder above CWD. Organizations can check in skills relevant to a shared area.
REPO	`$REPO_ROOT/.codex/skills`	Top-most root folder. Relevant to everyone using the repository.
USER	`$CODEX_HOME/skills`	Personal folder (`~/.codex/skills`). Curate skills that apply to any repository.

Install skill for GitHub Copilot

Run one of these commands to install the skill depending on your needs:

Project (.github/skills)

mkdir -p .github/skills/verl-rl-training && curl -L -o skill.zip "https://fastmcp.me/Skills/Download/2964" && unzip -o skill.zip -d .github/skills/verl-rl-training && rm skill.zip

New-Item -Path ".github/skills/verl-rl-training" -ItemType Directory -Force; Invoke-WebRequest -Uri "https://fastmcp.me/Skills/Download/2964" -OutFile "skill.zip"; Expand-Archive -Path "skill.zip" -DestinationPath ".github/skills/verl-rl-training" -Force; Remove-Item "skill.zip"

Personal (~/.copilot/skills)

mkdir -p ~/.copilot/skills/verl-rl-training && curl -L -o skill.zip "https://fastmcp.me/Skills/Download/2964" && unzip -o skill.zip -d ~/.copilot/skills/verl-rl-training && rm skill.zip

New-Item -Path "$HOME/.copilot/skills/verl-rl-training" -ItemType Directory -Force; Invoke-WebRequest -Uri "https://fastmcp.me/Skills/Download/2964" -OutFile "skill.zip"; Expand-Archive -Path "skill.zip" -DestinationPath "$HOME/.copilot/skills/verl-rl-training" -Force; Remove-Item "skill.zip"

Scope	Location	Suggested Use
Project	`.github/skills/`	Repository-specific skills. Checked into git for the whole team.
Personal	`~/.copilot/skills/`	Personal skills available across all your projects.

Install skill for Google Antigravity

Run one of these commands to install the skill depending on your needs:

Workspace (.agent/skills)

mkdir -p .agent/skills/verl-rl-training && curl -L -o skill.zip "https://fastmcp.me/Skills/Download/2964" && unzip -o skill.zip -d .agent/skills/verl-rl-training && rm skill.zip

New-Item -Path ".agent/skills/verl-rl-training" -ItemType Directory -Force; Invoke-WebRequest -Uri "https://fastmcp.me/Skills/Download/2964" -OutFile "skill.zip"; Expand-Archive -Path "skill.zip" -DestinationPath ".agent/skills/verl-rl-training" -Force; Remove-Item "skill.zip"

Global (~/.gemini/antigravity/skills)

mkdir -p ~/.gemini/antigravity/skills/verl-rl-training && curl -L -o skill.zip "https://fastmcp.me/Skills/Download/2964" && unzip -o skill.zip -d ~/.gemini/antigravity/skills/verl-rl-training && rm skill.zip

New-Item -Path "$HOME/.gemini/antigravity/skills/verl-rl-training" -ItemType Directory -Force; Invoke-WebRequest -Uri "https://fastmcp.me/Skills/Download/2964" -OutFile "skill.zip"; Expand-Archive -Path "skill.zip" -DestinationPath "$HOME/.gemini/antigravity/skills/verl-rl-training" -Force; Remove-Item "skill.zip"

Scope	Location	Suggested Use
Workspace	`.agent/skills/`	Workspace-specific skills for project workflows and conventions.
Global	`~/.gemini/antigravity/skills/`	Personal skills available across all workspaces.

Provides guidance for training LLMs with reinforcement learning using verl (Volcano Engine RL). Use when implementing RLHF, GRPO, PPO, or other RL algorithms for LLM post-training at scale with flexible infrastructure backends.

Research

0 views

0 installs

Source: https://github.com/davila7/claude-code-templates/tree/main/cli-tool/components/skills/ai-research/post-training-verl

Skill Content

---
name: verl-rl-training
description: Provides guidance for training LLMs with reinforcement learning using verl (Volcano Engine RL). Use when implementing RLHF, GRPO, PPO, or other RL algorithms for LLM post-training at scale with flexible infrastructure backends.
version: 1.0.0
author: Orchestra Research
license: MIT
tags: [Reinforcement Learning, RLHF, GRPO, PPO, Post-Training, Distributed Training]
dependencies: [verl>=0.3.0, torch>=2.0.0, ray>=2.41.0, vllm>=0.8.2, transformers>=4.40.0]
---

# verl: Volcano Engine Reinforcement Learning for LLMs

verl is a flexible, efficient, and production-ready RL training library for large language models from ByteDance's Seed team. It implements the HybridFlow framework (EuroSys 2025) and powers models like Doubao-1.5-pro achieving O1-level performance on math benchmarks.

## When to Use verl

**Choose verl when you need:**
- Production-ready RL training at scale (tested up to 671B parameters)
- Flexibility to swap backends (FSDP ↔ Megatron-LM ↔ vLLM ↔ SGLang)
- Support for multiple RL algorithms (PPO, GRPO, RLOO, REINFORCE++, DAPO)
- Multi-turn rollout with tool calling for agentic workflows
- Vision-language model RL training

**Consider alternatives when:**
- You need Megatron-native training → use **slime** or **miles**
- You want PyTorch-native abstractions with Monarch → use **torchforge**
- You only need simple SFT/DPO → use **TRL** or **Axolotl**

## Key Features

- **Training backends**: FSDP, FSDP2, Megatron-LM
- **Rollout engines**: vLLM, SGLang, HuggingFace Transformers
- **Algorithms**: PPO, GRPO, DAPO, RLOO, ReMax, REINFORCE++, SPIN, SPPO
- **Models**: Qwen-3, Llama-3.1, DeepSeek, Gemma-2 (0.5B to 671B)
- **Advanced**: LoRA RL, sequence parallelism, expert parallelism, multi-turn tools

## Installation

```bash
# Option 1: pip install
pip install verl[vllm]  # or verl[sglang] for SGLang backend

# Option 2: Docker (recommended for production)
docker pull verlai/verl:vllm011.latest

# Option 3: From source
git clone https://github.com/volcengine/verl.git
cd verl && pip install -e .[vllm,math]
```

## Quick Start: GRPO Training

```bash
python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=~/data/gsm8k/train.parquet \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-7B \
    actor_rollout_ref.rollout.n=8 \
    actor_rollout_ref.actor.use_kl_loss=True \
    trainer.n_gpus_per_node=8
```

## Core Architecture

verl uses a **HybridFlow** programming model separating control flow from computation:

```
┌─────────────────────────────────────────────────────────┐
│ Single-Process Controller (Ray)                         │
│ - Orchestrates: rollout → reward → train → sync        │
└─────────────────────┬───────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────┐
│ Multi-Process Workers                                   │
│ ├── ActorRolloutRefWorker (policy + generation)        │
│ ├── CriticWorker (value estimation, PPO only)          │
│ └── RewardManager (model-based or rule-based rewards)  │
└─────────────────────────────────────────────────────────┘
```

---

## Workflow 1: Math Reasoning with GRPO

Use this workflow for training reasoning models on math tasks like GSM8K or MATH.

### Prerequisites Checklist
- [ ] GPU cluster with 8+ GPUs (H100 recommended)
- [ ] Dataset in parquet format with `prompt` and `reward_model` columns
- [ ] Base model from HuggingFace Hub

### Step 1: Prepare Dataset

```python
import pandas as pd

data = [
    {
        "prompt": [{"role": "user", "content": "What is 15 + 27?"}],
        "reward_model": {"ground_truth": "42"}
    },
    # ... more examples
]
df = pd.DataFrame(data)
df.to_parquet("train.parquet")
```

### Step 2: Define Reward Function

```python
# reward_function.py
import re

def compute_reward(responses, ground_truths):
    rewards = []
    for response, gt in zip(responses, ground_truths):
        # Extract answer from response
        match = re.search(r'\\boxed{([^}]+)}', response)
        if match and match.group(1).strip() == gt.strip():
            rewards.append(1.0)
        else:
            rewards.append(0.0)
    return rewards
```

### Step 3: Create Training Config

```yaml
# config/grpo_math.yaml
algorithm:
  adv_estimator: grpo
  gamma: 1.0
  lam: 1.0

data:
  train_files: /path/to/train.parquet
  val_files: /path/to/val.parquet
  train_batch_size: 256
  max_prompt_length: 512
  max_response_length: 2048

actor_rollout_ref:
  model:
    path: Qwen/Qwen2.5-7B-Instruct
  actor:
    use_kl_loss: true
    kl_loss_coef: 0.001
    ppo_mini_batch_size: 64
  rollout:
    name: vllm
    n: 8  # samples per prompt
    temperature: 0.7
    top_p: 0.95

trainer:
  total_epochs: 3
  n_gpus_per_node: 8
  save_freq: 100
```

### Step 4: Launch Training

```bash
python3 -m verl.trainer.main_ppo \
    --config-path config \
    --config-name grpo_math \
    trainer.experiment_name=grpo_math_qwen7b
```

### Step 5: Monitor and Validate
- [ ] Check WandB/TensorBoard for loss curves
- [ ] Verify reward is increasing over steps
- [ ] Run evaluation on held-out test set

---

## Workflow 2: PPO with Critic Model

Use this workflow when you need value-based advantage estimation (GAE).

### Key Differences from GRPO
- Requires separate critic model
- Uses Generalized Advantage Estimation (GAE)
- Better for tasks with dense rewards

### Configuration

```yaml
algorithm:
  adv_estimator: gae  # Use GAE instead of GRPO
  gamma: 0.99
  lam: 0.95

critic:
  model:
    path: Qwen/Qwen2.5-7B-Instruct  # Can be same or different from actor
  ppo_mini_batch_size: 64

actor_rollout_ref:
  actor:
    use_kl_loss: true
    kl_loss_coef: 0.02
    clip_ratio: 0.2  # PPO clipping
```

### Launch with Critic

```bash
python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=gae \
    critic.model.path=Qwen/Qwen2.5-7B-Instruct \
    trainer.n_gpus_per_node=8
```

---

## Workflow 3: Large-Scale Training with Megatron

Use this workflow for models >70B parameters or when you need expert parallelism.

### Prerequisites
- [ ] Install Megatron-LM bridge: `pip install mbridge`
- [ ] Convert model to Megatron format
- [ ] Multi-node cluster with NVLink/InfiniBand

### Configuration for 70B+ Models

```yaml
actor_rollout_ref:
  model:
    path: /path/to/megatron/checkpoint
    backend: megatron
  actor:
    strategy: megatron
    tensor_model_parallel_size: 8
    pipeline_model_parallel_size: 2
  rollout:
    name: vllm
    tensor_parallel_size: 8
```

### Launch Multi-Node

```bash
# On head node
ray start --head --port=6379

# On worker nodes
ray start --address='head_ip:6379'

# Launch training
python3 -m verl.trainer.main_ppo \
    trainer.nnodes=4 \
    trainer.n_gpus_per_node=8
```

---

## Configuration Reference

### Algorithm Selection

| Algorithm | `adv_estimator` | Use Case |
|-----------|-----------------|----------|
| GRPO | `grpo` | Critic-free, math/reasoning |
| PPO/GAE | `gae` | Dense rewards, value estimation |
| REINFORCE++ | `reinforce_plus_plus` | Variance reduction |
| RLOO | `rloo` | Leave-one-out baseline |
| ReMax | `remax` | Maximum reward baseline |
| OPO | `opo` | Optimal policy optimization |

### Key Parameters

```yaml
# Rollout parameters
actor_rollout_ref.rollout.n: 8              # Samples per prompt
actor_rollout_ref.rollout.temperature: 0.7  # Sampling temperature
actor_rollout_ref.rollout.top_p: 0.95       # Nucleus sampling

# Training parameters
actor_rollout_ref.actor.lr: 1e-6            # Learning rate
actor_rollout_ref.actor.ppo_mini_batch_size: 64
actor_rollout_ref.actor.clip_ratio: 0.2     # PPO clip range

# KL control
actor_rollout_ref.actor.use_kl_loss: true
actor_rollout_ref.actor.kl_loss_coef: 0.001
algorithm.kl_ctrl.target_kl: 0.1            # For adaptive KL control
```

---

## Common Issues and Solutions

### Issue: OOM During Rollout

**Symptoms**: CUDA out of memory during generation phase

**Solutions**:
```yaml
# Reduce batch size
actor_rollout_ref.rollout.log_prob_micro_batch_size: 4

# Enable gradient checkpointing
actor_rollout_ref.model.enable_gradient_checkpointing: true

# Use FSDP2 with CPU offloading
actor_rollout_ref.actor.strategy: fsdp2
actor_rollout_ref.actor.fsdp_config.offload_policy: true
```

### Issue: Training Instability

**Symptoms**: Loss spikes, reward collapse

**Solutions**:
```yaml
# Reduce learning rate
actor_rollout_ref.actor.lr: 5e-7

# Increase KL penalty
actor_rollout_ref.actor.kl_loss_coef: 0.01

# Enable gradient clipping
actor_rollout_ref.actor.max_grad_norm: 1.0
```

### Issue: Slow Weight Sync

**Symptoms**: Long pauses between rollout and training

**Solutions**:
```bash
# Use FSDP2 for faster resharding
actor_rollout_ref.actor.strategy=fsdp2

# Enable async weight transfer
trainer.async_weight_update=true
```

### Issue: vLLM Version Mismatch

**Symptoms**: Import errors or generation failures

**Solution**: Use compatible versions:
```bash
pip install vllm>=0.8.5,<=0.12.0
# Avoid vLLM 0.7.x (known bugs)
```

---

## Advanced Topics

### Multi-Turn Tool Calling

See [references/multi-turn.md](references/multi-turn.md) for agentic workflows with tool use.

### Vision-Language Models

```yaml
actor_rollout_ref:
  model:
    path: Qwen/Qwen2.5-VL-7B-Instruct
  rollout:
    name: vllm
    enable_vision: true
```

### LoRA Training

```yaml
actor_rollout_ref:
  actor:
    lora:
      enabled: true
      r: 16
      alpha: 32
      target_modules: ["q_proj", "v_proj"]
```

---

## Resources

- **Documentation**: https://verl.readthedocs.io/
- **Paper**: https://arxiv.org/abs/2409.19256
- **GitHub**: https://github.com/volcengine/verl
- **Recipes**: https://github.com/verl-project/verl-recipe (DAPO, GSPO, etc.)
- **Community**: Slack at verl-project

davila7 / verl-rl-training

Install for your project team

Download skill

Enable skills in Claude

Upload to Claude

Install skill for Codex

Install skill for GitHub Copilot

Install skill for Google Antigravity

Skill Content