kengz / slm-lab-benchmark
Install for your project team
Run this command in your project directory to install the skill for your entire team:
mkdir -p .claude/skills/slm-lab-benchmark && curl -L -o skill.zip "https://fastmcp.me/Skills/Download/3505" && unzip -o skill.zip -d .claude/skills/slm-lab-benchmark && rm skill.zip
Project Skills
This skill will be saved in .claude/skills/slm-lab-benchmark/ and checked into git. All team members will have access to it automatically.
Important: Please verify the skill by reviewing its instructions before using it.
Run SLM-Lab deep RL benchmarks, monitor dstack jobs, extract results, and update BENCHMARKS.md. Use when asked to run benchmarks, check run status, extract scores, update benchmark tables, or generate plots.
0 views
0 installs
Skill Content
---
name: slm-lab-benchmark
description: Run SLM-Lab deep RL benchmarks, monitor dstack jobs, extract results, and update BENCHMARKS.md. Use when asked to run benchmarks, check run status, extract scores, update benchmark tables, or generate plots.
---
# SLM-Lab Benchmark Skill
## Critical Rules
1. **NEVER push to remote** without explicit user permission
2. **ONLY train runs** in BENCHMARKS.md — never search results
3. **Respect Settings line** for each env (max_frame, num_envs, etc.)
4. **Use `${max_frame}` variable** in specs — never hardcode
5. **Runs must complete in <6h** (dstack max_duration)
6. **Max 10 concurrent dstack runs** — launch in batches of 10, wait for capacity/completion before launching more. Never submit all runs at once; dstack capacity is limited and mass submissions cause "no offers" failures
## Per-Run Intake Checklist
**Every completed run MUST go through ALL of these steps. No exceptions. Do not skip any step.**
When a run completes (`dstack ps` shows `exited (0)`):
1. **Extract score**: `dstack logs NAME | grep "trial_metrics"` → get `total_reward_ma`
2. **Find HF folder name**: `dstack logs NAME 2>&1 | grep "Uploading data/"` → extract folder name from the upload log line
3. **Update table score** in BENCHMARKS.md
4. **Update table HF link**: `[FOLDER](https://huggingface.co/datasets/SLM-Lab/benchmark-dev/tree/main/data/FOLDER)`
5. **Pull HF data locally**: `source .env && huggingface-cli download SLM-Lab/benchmark-dev --local-dir data/benchmark-dev --repo-type dataset --include "data/FOLDER/*"`
6. **Generate plot**: List ALL data folders for that env (`ls data/benchmark-dev/data/ | grep -i envname`), then generate with ONLY the folders matching BENCHMARKS.md entries:
```bash
uv run slm-lab plot -t "EnvName" -d data/benchmark-dev/data -f FOLDER1,FOLDER2,...
```
NOTE: `-d` sets the base data dir, `-f` takes folder names (NOT full paths).
If some folders are in `data/` (local runs) and some in `data/benchmark-dev/data/`, use `data/` as base (it has the `info/` subfolder needed for metrics).
7. **Verify plot exists** in `docs/plots/`
8. **Commit** score + link + plot together
A row in BENCHMARKS.md is NOT complete until it has: score, HF link, and plot.
## Per-Run Graduation Checklist
**After intake, graduate each finalized run to public HF benchmark:**
1. **Upload folder to public HF**:
```bash
source .env && huggingface-cli upload SLM-Lab/benchmark data/benchmark-dev/data/FOLDER data/FOLDER --repo-type dataset
```
2. **Update BENCHMARKS.md link**: Change `SLM-Lab/benchmark-dev` → `SLM-Lab/benchmark` for that entry
3. **Upload docs/ to public HF** (updated plots + BENCHMARKS.md):
```bash
source .env && huggingface-cli upload SLM-Lab/benchmark docs docs --repo-type dataset
source .env && huggingface-cli upload SLM-Lab/benchmark README.md README.md --repo-type dataset
```
4. **Commit** link update
5. **Push** to origin
## Launch
```bash
# Launch a run
source .env && uv run slm-lab run-remote --gpu \
-s env=ALE/Pong-v5 SPEC_FILE SPEC_NAME train -n NAME
# Monitor
dstack ps # running jobs
dstack logs NAME | grep "trial_metrics" # extract score at completion
# Score = total_reward_ma from trial_metrics line
# trial_metrics: frame:1.00e+07 | total_reward_ma:816.18 | ...
```
## Data Lifecycle
```
Remote GPU run → auto-uploads to benchmark-dev (HF)
↓ Pull to local data/
↓ Generate plots (docs/plots/)
↓ Update BENCHMARKS.md (scores, links, plots)
↓ Graduate to public benchmark (HF)
↓ Update links: benchmark-dev → benchmark
↓ Upload docs/ to public benchmark (HF)
```
### Pull Data
```bash
# Pull full dataset (fast, single request — avoids rate limits)
source .env && hf download SLM-Lab/benchmark-dev \
--local-dir data/benchmark-dev --repo-type dataset
# Or pull specific folder
source .env && hf download SLM-Lab/benchmark-dev \
--local-dir data/benchmark-dev --repo-type dataset --include "data/FOLDER/*"
# KEEP this data — needed for plots AND graduation upload later
```
### Generate Plots
```bash
# Find folders for a game (check both local data/ and benchmark-dev)
ls data/ | grep -i pong
ls data/benchmark-dev/data/ | grep -i pong
# Generate comparison plot — use -d for base dir, -f for folder names only
# Use data/ as base (has info/ subfolder with trial_metrics)
uv run slm-lab plot -t "Pong-v5" -f ppo_pong_folder,sac_pong_folder,crossq_pong_folder
```
### Graduate to Public HF
When a run is finalized, graduate individually from `benchmark-dev` → `benchmark`:
```bash
# Upload individual folder
source .env && huggingface-cli upload SLM-Lab/benchmark \
data/benchmark-dev/data/FOLDER data/FOLDER --repo-type dataset
# Update BENCHMARKS.md link for that entry: benchmark-dev → benchmark
# Then upload docs/ (includes updated plots + BENCHMARKS.md)
source .env && huggingface-cli upload SLM-Lab/benchmark docs docs --repo-type dataset
source .env && huggingface-cli upload SLM-Lab/benchmark README.md README.md --repo-type dataset
```
| Repo | Purpose |
|------|---------|
| `SLM-Lab/benchmark-dev` | Development — noisy, iterative |
| `SLM-Lab/benchmark` | Public — finalized, validated |
## Hyperparameter Search
Only when algorithm fails to reach target:
```bash
source .env && uv run slm-lab run-remote --gpu SPEC_FILE SPEC_NAME search -n NAME
```
Budget: ~3-4 trials per dimension. After search: update spec with best params, run `train`, use that result.
## Autonomous Execution
Work continuously when benchmarking. Use `sleep 300 && dstack ps` to actively wait (5 min intervals) — never delegate monitoring to background processes or scripts. Stay engaged in the conversation.
**Workflow loop** (repeat every 5-10 minutes):
1. **Check status**: `dstack ps` — identify completed/failed/running
2. **Intake completed runs**: For EACH completed run, do the full intake checklist above (score → HF link → pull → plot → table update)
3. **Launch next batch**: Up to 10 concurrent. Check capacity before launching more
4. **Iterate on failures**: Relaunch or adjust config immediately
5. **Commit progress**: Regular commits of score + link + plot updates
**Key principle**: Work continuously, check in regularly, iterate immediately on failures. Never idle. Keep reminding yourself to continue without pausing — check on tasks, update, plan, and pick up the next task immediately until all tasks are completed.
## Troubleshooting
- **Run interrupted**: Relaunch, increment name suffix (e.g., pong3 → pong4)
- **Low GPU usage** (<50%): CPU bottleneck or config issue
- **HF rate limit**: Download full dataset, not selective `--include` patterns
- **HF link 404**: Run didn't complete or upload failed — rerun
- **.env inline comments**: Break dstack env vars — put comments on separate lines