kengz / slm-lab-benchmark

Install for your project team

Run this command in your project directory to install the skill for your entire team:

mkdir -p .claude/skills/slm-lab-benchmark && curl -L -o skill.zip "https://fastmcp.me/Skills/Download/3505" && unzip -o skill.zip -d .claude/skills/slm-lab-benchmark && rm skill.zip

New-Item -Path ".claude/skills/slm-lab-benchmark" -ItemType Directory -Force; Invoke-WebRequest -Uri "https://fastmcp.me/Skills/Download/3505" -OutFile "skill.zip"; Expand-Archive -Path "skill.zip" -DestinationPath ".claude/skills/slm-lab-benchmark" -Force; Remove-Item "skill.zip"

Project Skills

This skill will be saved in .claude/skills/slm-lab-benchmark/ and checked into git. All team members will have access to it automatically.

Important: Please verify the skill by reviewing its instructions before using it.

Install skill for Codex

Run one of these commands to install the skill depending on your needs:

Project Local ($CWD/.codex/skills)

mkdir -p .codex/skills/slm-lab-benchmark && curl -L -o skill.zip "https://fastmcp.me/Skills/Download/3505" && unzip -o skill.zip -d .codex/skills/slm-lab-benchmark && rm skill.zip

New-Item -Path ".codex/skills/slm-lab-benchmark" -ItemType Directory -Force; Invoke-WebRequest -Uri "https://fastmcp.me/Skills/Download/3505" -OutFile "skill.zip"; Expand-Archive -Path "skill.zip" -DestinationPath ".codex/skills/slm-lab-benchmark" -Force; Remove-Item "skill.zip"

User Global (~/.codex/skills)

mkdir -p ~/.codex/skills/slm-lab-benchmark && curl -L -o skill.zip "https://fastmcp.me/Skills/Download/3505" && unzip -o skill.zip -d ~/.codex/skills/slm-lab-benchmark && rm skill.zip

New-Item -Path "$HOME/.codex/skills/slm-lab-benchmark" -ItemType Directory -Force; Invoke-WebRequest -Uri "https://fastmcp.me/Skills/Download/3505" -OutFile "skill.zip"; Expand-Archive -Path "skill.zip" -DestinationPath "$HOME/.codex/skills/slm-lab-benchmark" -Force; Remove-Item "skill.zip"

Scope	Location	Suggested Use
REPO	`$CWD/.codex/skills`	Project directory. Teams can check in skills most relevant to a working folder here.
REPO	`$CWD/../.codex/skills`	A folder above CWD. Organizations can check in skills relevant to a shared area.
REPO	`$REPO_ROOT/.codex/skills`	Top-most root folder. Relevant to everyone using the repository.
USER	`$CODEX_HOME/skills`	Personal folder (`~/.codex/skills`). Curate skills that apply to any repository.

Install skill for GitHub Copilot

Run one of these commands to install the skill depending on your needs:

Project (.github/skills)

mkdir -p .github/skills/slm-lab-benchmark && curl -L -o skill.zip "https://fastmcp.me/Skills/Download/3505" && unzip -o skill.zip -d .github/skills/slm-lab-benchmark && rm skill.zip

New-Item -Path ".github/skills/slm-lab-benchmark" -ItemType Directory -Force; Invoke-WebRequest -Uri "https://fastmcp.me/Skills/Download/3505" -OutFile "skill.zip"; Expand-Archive -Path "skill.zip" -DestinationPath ".github/skills/slm-lab-benchmark" -Force; Remove-Item "skill.zip"

Personal (~/.copilot/skills)

mkdir -p ~/.copilot/skills/slm-lab-benchmark && curl -L -o skill.zip "https://fastmcp.me/Skills/Download/3505" && unzip -o skill.zip -d ~/.copilot/skills/slm-lab-benchmark && rm skill.zip

New-Item -Path "$HOME/.copilot/skills/slm-lab-benchmark" -ItemType Directory -Force; Invoke-WebRequest -Uri "https://fastmcp.me/Skills/Download/3505" -OutFile "skill.zip"; Expand-Archive -Path "skill.zip" -DestinationPath "$HOME/.copilot/skills/slm-lab-benchmark" -Force; Remove-Item "skill.zip"

Scope	Location	Suggested Use
Project	`.github/skills/`	Repository-specific skills. Checked into git for the whole team.
Personal	`~/.copilot/skills/`	Personal skills available across all your projects.

Install skill for Google Antigravity

Run one of these commands to install the skill depending on your needs:

Workspace (.agent/skills)

mkdir -p .agent/skills/slm-lab-benchmark && curl -L -o skill.zip "https://fastmcp.me/Skills/Download/3505" && unzip -o skill.zip -d .agent/skills/slm-lab-benchmark && rm skill.zip

New-Item -Path ".agent/skills/slm-lab-benchmark" -ItemType Directory -Force; Invoke-WebRequest -Uri "https://fastmcp.me/Skills/Download/3505" -OutFile "skill.zip"; Expand-Archive -Path "skill.zip" -DestinationPath ".agent/skills/slm-lab-benchmark" -Force; Remove-Item "skill.zip"

Global (~/.gemini/antigravity/skills)

mkdir -p ~/.gemini/antigravity/skills/slm-lab-benchmark && curl -L -o skill.zip "https://fastmcp.me/Skills/Download/3505" && unzip -o skill.zip -d ~/.gemini/antigravity/skills/slm-lab-benchmark && rm skill.zip

New-Item -Path "$HOME/.gemini/antigravity/skills/slm-lab-benchmark" -ItemType Directory -Force; Invoke-WebRequest -Uri "https://fastmcp.me/Skills/Download/3505" -OutFile "skill.zip"; Expand-Archive -Path "skill.zip" -DestinationPath "$HOME/.gemini/antigravity/skills/slm-lab-benchmark" -Force; Remove-Item "skill.zip"

Scope	Location	Suggested Use
Workspace	`.agent/skills/`	Workspace-specific skills for project workflows and conventions.
Global	`~/.gemini/antigravity/skills/`	Personal skills available across all workspaces.

Run SLM-Lab deep RL benchmarks, monitor dstack jobs, extract results, and update BENCHMARKS.md. Use when asked to run benchmarks, check run status, extract scores, update benchmark tables, or generate plots.

Data & Analytics

0 views

0 installs

Source: https://github.com/kengz/SLM-Lab/tree/master/.claude/skills/benchmark

Skill Content

---
name: slm-lab-benchmark
description: Run SLM-Lab deep RL benchmarks, monitor dstack jobs, extract results, and update BENCHMARKS.md. Use when asked to run benchmarks, check run status, extract scores, update benchmark tables, or generate plots.
---

# SLM-Lab Benchmark Skill

## Critical Rules

1. **NEVER push to remote** without explicit user permission
2. **ONLY train runs** in BENCHMARKS.md — never search results
3. **Respect Settings line** for each env (max_frame, num_envs, etc.)
4. **Use `${max_frame}` variable** in specs — never hardcode
5. **Runs must complete in <6h** (dstack max_duration)
6. **Max 10 concurrent dstack runs** — launch in batches of 10, wait for capacity/completion before launching more. Never submit all runs at once; dstack capacity is limited and mass submissions cause "no offers" failures

## Per-Run Intake Checklist

**Every completed run MUST go through ALL of these steps. No exceptions. Do not skip any step.**

When a run completes (`dstack ps` shows `exited (0)`):

1. **Extract score**: `dstack logs NAME | grep "trial_metrics"` → get `total_reward_ma`
2. **Find HF folder name**: `dstack logs NAME 2>&1 | grep "Uploading data/"` → extract folder name from the upload log line
3. **Update table score** in BENCHMARKS.md
4. **Update table HF link**: `[FOLDER](https://huggingface.co/datasets/SLM-Lab/benchmark-dev/tree/main/data/FOLDER)`
5. **Pull HF data locally**: `source .env && huggingface-cli download SLM-Lab/benchmark-dev --local-dir data/benchmark-dev --repo-type dataset --include "data/FOLDER/*"`
6. **Generate plot**: List ALL data folders for that env (`ls data/benchmark-dev/data/ | grep -i envname`), then generate with ONLY the folders matching BENCHMARKS.md entries:
   ```bash
   uv run slm-lab plot -t "EnvName" -d data/benchmark-dev/data -f FOLDER1,FOLDER2,...
   ```
   NOTE: `-d` sets the base data dir, `-f` takes folder names (NOT full paths).
   If some folders are in `data/` (local runs) and some in `data/benchmark-dev/data/`, use `data/` as base (it has the `info/` subfolder needed for metrics).
7. **Verify plot exists** in `docs/plots/`
8. **Commit** score + link + plot together

A row in BENCHMARKS.md is NOT complete until it has: score, HF link, and plot.

## Per-Run Graduation Checklist

**After intake, graduate each finalized run to public HF benchmark:**

1. **Upload folder to public HF**:
   ```bash
   source .env && huggingface-cli upload SLM-Lab/benchmark data/benchmark-dev/data/FOLDER data/FOLDER --repo-type dataset
   ```
2. **Update BENCHMARKS.md link**: Change `SLM-Lab/benchmark-dev` → `SLM-Lab/benchmark` for that entry
3. **Upload docs/ to public HF** (updated plots + BENCHMARKS.md):
   ```bash
   source .env && huggingface-cli upload SLM-Lab/benchmark docs docs --repo-type dataset
   source .env && huggingface-cli upload SLM-Lab/benchmark README.md README.md --repo-type dataset
   ```
4. **Commit** link update
5. **Push** to origin

## Launch

```bash
# Launch a run
source .env && uv run slm-lab run-remote --gpu \
  -s env=ALE/Pong-v5 SPEC_FILE SPEC_NAME train -n NAME

# Monitor
dstack ps                              # running jobs
dstack logs NAME | grep "trial_metrics" # extract score at completion

# Score = total_reward_ma from trial_metrics line
# trial_metrics: frame:1.00e+07 | total_reward_ma:816.18 | ...
```

## Data Lifecycle

```
Remote GPU run → auto-uploads to benchmark-dev (HF)
  ↓ Pull to local data/
  ↓ Generate plots (docs/plots/)
  ↓ Update BENCHMARKS.md (scores, links, plots)
  ↓ Graduate to public benchmark (HF)
  ↓ Update links: benchmark-dev → benchmark
  ↓ Upload docs/ to public benchmark (HF)
```

### Pull Data

```bash
# Pull full dataset (fast, single request — avoids rate limits)
source .env && hf download SLM-Lab/benchmark-dev \
  --local-dir data/benchmark-dev --repo-type dataset

# Or pull specific folder
source .env && hf download SLM-Lab/benchmark-dev \
  --local-dir data/benchmark-dev --repo-type dataset --include "data/FOLDER/*"

# KEEP this data — needed for plots AND graduation upload later
```

### Generate Plots

```bash
# Find folders for a game (check both local data/ and benchmark-dev)
ls data/ | grep -i pong
ls data/benchmark-dev/data/ | grep -i pong

# Generate comparison plot — use -d for base dir, -f for folder names only
# Use data/ as base (has info/ subfolder with trial_metrics)
uv run slm-lab plot -t "Pong-v5" -f ppo_pong_folder,sac_pong_folder,crossq_pong_folder
```

### Graduate to Public HF

When a run is finalized, graduate individually from `benchmark-dev` → `benchmark`:

```bash
# Upload individual folder
source .env && huggingface-cli upload SLM-Lab/benchmark \
  data/benchmark-dev/data/FOLDER data/FOLDER --repo-type dataset

# Update BENCHMARKS.md link for that entry: benchmark-dev → benchmark
# Then upload docs/ (includes updated plots + BENCHMARKS.md)
source .env && huggingface-cli upload SLM-Lab/benchmark docs docs --repo-type dataset
source .env && huggingface-cli upload SLM-Lab/benchmark README.md README.md --repo-type dataset
```

| Repo | Purpose |
|------|---------|
| `SLM-Lab/benchmark-dev` | Development — noisy, iterative |
| `SLM-Lab/benchmark` | Public — finalized, validated |

## Hyperparameter Search

Only when algorithm fails to reach target:

```bash
source .env && uv run slm-lab run-remote --gpu SPEC_FILE SPEC_NAME search -n NAME
```

Budget: ~3-4 trials per dimension. After search: update spec with best params, run `train`, use that result.

## Autonomous Execution

Work continuously when benchmarking. Use `sleep 300 && dstack ps` to actively wait (5 min intervals) — never delegate monitoring to background processes or scripts. Stay engaged in the conversation.

**Workflow loop** (repeat every 5-10 minutes):
1. **Check status**: `dstack ps` — identify completed/failed/running
2. **Intake completed runs**: For EACH completed run, do the full intake checklist above (score → HF link → pull → plot → table update)
3. **Launch next batch**: Up to 10 concurrent. Check capacity before launching more
4. **Iterate on failures**: Relaunch or adjust config immediately
5. **Commit progress**: Regular commits of score + link + plot updates

**Key principle**: Work continuously, check in regularly, iterate immediately on failures. Never idle. Keep reminding yourself to continue without pausing — check on tasks, update, plan, and pick up the next task immediately until all tasks are completed.

## Troubleshooting

- **Run interrupted**: Relaunch, increment name suffix (e.g., pong3 → pong4)
- **Low GPU usage** (<50%): CPU bottleneck or config issue
- **HF rate limit**: Download full dataset, not selective `--include` patterns
- **HF link 404**: Run didn't complete or upload failed — rerun
- **.env inline comments**: Break dstack env vars — put comments on separate lines

kengz / slm-lab-benchmark

Install for your project team

Download skill

Enable skills in Claude

Upload to Claude

Install skill for Codex

Install skill for GitHub Copilot

Install skill for Google Antigravity

Skill Content