truera / trulens-evaluation-workflow

Install for your project team

Run this command in your project directory to install the skill for your entire team:

mkdir -p .claude/skills/trulens-evaluation-workflow && curl -L -o skill.zip "https://fastmcp.me/Skills/Download/1124" && unzip -o skill.zip -d .claude/skills/trulens-evaluation-workflow && rm skill.zip

New-Item -Path ".claude/skills/trulens-evaluation-workflow" -ItemType Directory -Force; Invoke-WebRequest -Uri "https://fastmcp.me/Skills/Download/1124" -OutFile "skill.zip"; Expand-Archive -Path "skill.zip" -DestinationPath ".claude/skills/trulens-evaluation-workflow" -Force; Remove-Item "skill.zip"

Project Skills

This skill will be saved in .claude/skills/trulens-evaluation-workflow/ and checked into git. All team members will have access to it automatically.

Important: Please verify the skill by reviewing its instructions before using it.

Install skill for Codex

Run one of these commands to install the skill depending on your needs:

Project Local ($CWD/.codex/skills)

mkdir -p .codex/skills/trulens-evaluation-workflow && curl -L -o skill.zip "https://fastmcp.me/Skills/Download/1124" && unzip -o skill.zip -d .codex/skills/trulens-evaluation-workflow && rm skill.zip

New-Item -Path ".codex/skills/trulens-evaluation-workflow" -ItemType Directory -Force; Invoke-WebRequest -Uri "https://fastmcp.me/Skills/Download/1124" -OutFile "skill.zip"; Expand-Archive -Path "skill.zip" -DestinationPath ".codex/skills/trulens-evaluation-workflow" -Force; Remove-Item "skill.zip"

User Global (~/.codex/skills)

mkdir -p ~/.codex/skills/trulens-evaluation-workflow && curl -L -o skill.zip "https://fastmcp.me/Skills/Download/1124" && unzip -o skill.zip -d ~/.codex/skills/trulens-evaluation-workflow && rm skill.zip

New-Item -Path "$HOME/.codex/skills/trulens-evaluation-workflow" -ItemType Directory -Force; Invoke-WebRequest -Uri "https://fastmcp.me/Skills/Download/1124" -OutFile "skill.zip"; Expand-Archive -Path "skill.zip" -DestinationPath "$HOME/.codex/skills/trulens-evaluation-workflow" -Force; Remove-Item "skill.zip"

Scope	Location	Suggested Use
REPO	`$CWD/.codex/skills`	Project directory. Teams can check in skills most relevant to a working folder here.
REPO	`$CWD/../.codex/skills`	A folder above CWD. Organizations can check in skills relevant to a shared area.
REPO	`$REPO_ROOT/.codex/skills`	Top-most root folder. Relevant to everyone using the repository.
USER	`$CODEX_HOME/skills`	Personal folder (`~/.codex/skills`). Curate skills that apply to any repository.

Install skill for GitHub Copilot

Run one of these commands to install the skill depending on your needs:

Project (.github/skills)

mkdir -p .github/skills/trulens-evaluation-workflow && curl -L -o skill.zip "https://fastmcp.me/Skills/Download/1124" && unzip -o skill.zip -d .github/skills/trulens-evaluation-workflow && rm skill.zip

New-Item -Path ".github/skills/trulens-evaluation-workflow" -ItemType Directory -Force; Invoke-WebRequest -Uri "https://fastmcp.me/Skills/Download/1124" -OutFile "skill.zip"; Expand-Archive -Path "skill.zip" -DestinationPath ".github/skills/trulens-evaluation-workflow" -Force; Remove-Item "skill.zip"

Personal (~/.copilot/skills)

mkdir -p ~/.copilot/skills/trulens-evaluation-workflow && curl -L -o skill.zip "https://fastmcp.me/Skills/Download/1124" && unzip -o skill.zip -d ~/.copilot/skills/trulens-evaluation-workflow && rm skill.zip

New-Item -Path "$HOME/.copilot/skills/trulens-evaluation-workflow" -ItemType Directory -Force; Invoke-WebRequest -Uri "https://fastmcp.me/Skills/Download/1124" -OutFile "skill.zip"; Expand-Archive -Path "skill.zip" -DestinationPath "$HOME/.copilot/skills/trulens-evaluation-workflow" -Force; Remove-Item "skill.zip"

Scope	Location	Suggested Use
Project	`.github/skills/`	Repository-specific skills. Checked into git for the whole team.
Personal	`~/.copilot/skills/`	Personal skills available across all your projects.

Install skill for Google Antigravity

Run one of these commands to install the skill depending on your needs:

Workspace (.agent/skills)

mkdir -p .agent/skills/trulens-evaluation-workflow && curl -L -o skill.zip "https://fastmcp.me/Skills/Download/1124" && unzip -o skill.zip -d .agent/skills/trulens-evaluation-workflow && rm skill.zip

New-Item -Path ".agent/skills/trulens-evaluation-workflow" -ItemType Directory -Force; Invoke-WebRequest -Uri "https://fastmcp.me/Skills/Download/1124" -OutFile "skill.zip"; Expand-Archive -Path "skill.zip" -DestinationPath ".agent/skills/trulens-evaluation-workflow" -Force; Remove-Item "skill.zip"

Global (~/.gemini/antigravity/skills)

mkdir -p ~/.gemini/antigravity/skills/trulens-evaluation-workflow && curl -L -o skill.zip "https://fastmcp.me/Skills/Download/1124" && unzip -o skill.zip -d ~/.gemini/antigravity/skills/trulens-evaluation-workflow && rm skill.zip

New-Item -Path "$HOME/.gemini/antigravity/skills/trulens-evaluation-workflow" -ItemType Directory -Force; Invoke-WebRequest -Uri "https://fastmcp.me/Skills/Download/1124" -OutFile "skill.zip"; Expand-Archive -Path "skill.zip" -DestinationPath "$HOME/.gemini/antigravity/skills/trulens-evaluation-workflow" -Force; Remove-Item "skill.zip"

Scope	Location	Suggested Use
Workspace	`.agent/skills/`	Workspace-specific skills for project workflows and conventions.
Global	`~/.gemini/antigravity/skills/`	Personal skills available across all workspaces.

Systematically evaluate your LLM application with TruLens

0 views

0 installs

Tools I Recommend

ClaudeKit

Sponsor

Production-ready AI subagents automate your development & marketing workflows. Build in hours, not weeks.

Source: https://github.com/truera/trulens/tree/main/skills

Skill Content

---
skill_spec_version: 0.1.0
name: trulens-evaluation-workflow
version: 1.0.0
description: Systematically evaluate your LLM application with TruLens
tags: [trulens, llm, evaluation, workflow, orchestration]
---

# TruLens Evaluation Workflow

A systematic approach to evaluating your LLM application.

## When to Use This Skill

Use this skill when you want to:
- Set up comprehensive evaluation for a new LLM app
- Improve an existing app's evaluation coverage
- Understand the full TruLens workflow
- Know which sub-skill to use for your current task

## Required Questions to Ask User

**Before implementing, always ask the user these questions:**

### 1. App Type (determines instrumentation wrapper)
- What framework is your app built with? (LangChain, LangGraph/Deep Agents, LlamaIndex, Custom)

### 2. Evaluation Metrics (determines feedback functions)
Ask: **"Which evaluation metrics would you like to use?"**

| App Type | Recommended Metrics | Description |
|----------|--------------------| ------------|
| **RAG** | RAG Triad | Context Relevance, Groundedness, Answer Relevance |
| **Agent** | Agent GPA | Tool Selection, Tool Calling, Execution Efficiency, etc. |
| **Simple** | Answer Relevance | Basic input-to-output relevance check |
| **Custom** | Ask user | Let user describe what they want to evaluate |

**For Agents, also ask:**
- Does your agent do explicit planning? (determines if Plan Quality/Adherence metrics apply)

### 3. Additional Metrics (optional)
- Do you want any additional evaluations? (Coherence, Conciseness, Harmlessness, custom metrics)

## The Evaluation Workflow

```
┌─────────────────────────────────────────────────────────────────┐
│                    TruLens Evaluation Workflow                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   1. INSTRUMENT          2. CURATE            3. CONFIGURE       │
│   ┌──────────────┐      ┌──────────────┐     ┌──────────────┐   │
│   │ Capture data │  →   │ Build test   │  →  │ Choose       │   │
│   │ from your    │      │ datasets     │     │ metrics      │   │
│   │ app          │      │              │     │              │   │
│   └──────────────┘      └──────────────┘     └──────────────┘   │
│         ↓                                           ↓            │
│         └─────────────────────┬─────────────────────┘            │
│                               ↓                                  │
│                      4. RUN & ANALYZE                            │
│                      ┌──────────────┐                            │
│                      │ Execute evals│                            │
│                      │ & iterate    │                            │
│                      └──────────────┘                            │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
```

## Sub-Skills Reference

| Step | Skill | When to Use |
|------|-------|-------------|
| **1. Instrument** | `instrumentation/` | Setting up a new app, adding custom spans, capturing specific data for evals |
| **2. Curate** | `dataset-curation/` | Creating test datasets, storing ground truth, ingesting external logs |
| **3. Configure** | `evaluation-setup/` | Choosing metrics (RAG triad vs Agent GPA), setting up feedback functions |
| **4. Run** | `running-evaluations/` | Executing evaluations, viewing results, comparing versions |

## Interactive Workflow Guide

**Answer these questions to find where to start:**

### Where are you in the process?

**"I have a new LLM app that isn't instrumented yet"**
→ Start with `instrumentation/` skill

**"My app is instrumented but I don't have test data"**
→ Go to `dataset-curation/` skill

**"I have data but haven't set up evaluations"**
→ Go to `evaluation-setup/` skill

**"Everything is set up, I just need to run evals"**
→ Go to `running-evaluations/` skill

---

### What's your immediate goal?

**"I want to see traces of my app's execution"**
→ Use `instrumentation/` - capture spans and view in dashboard

**"I want to evaluate my RAG's retrieval quality"**
→ Use `evaluation-setup/` - configure RAG Triad metrics

**"I want to evaluate my agent's tool usage"**
→ Use `evaluation-setup/` - configure Agent GPA metrics

**"I want to compare two versions of my app"**
→ Use `running-evaluations/` - version comparison pattern

**"I want to evaluate against known correct answers"**
→ Use `dataset-curation/` - create ground truth dataset

---

## Quick Start Paths

### Path A: Evaluate a RAG App

1. **Instrument** → Wrap with `TruLlama` or `TruChain`
2. **Configure** → Set up RAG Triad (context relevance, groundedness, answer relevance)
3. **Run** → Execute queries and view leaderboard

### Path B: Evaluate an Agent

1. **Instrument** → Wrap with `TruGraph` (for LangGraph/Deep Agents)
2. **Configure** → Set up Agent GPA metrics (or Answer Relevance for simple evals)
3. **Run** → Execute tasks and analyze traces

**Note**: For LangGraph-based frameworks like Deep Agents, always use `TruGraph` rather than manual `@instrument()` decorators. TruGraph automatically creates the correct span types and captures all graph transitions.

### Path C: Regression Testing

1. **Curate** → Create ground truth test dataset
2. **Configure** → Add ground truth agreement metric
3. **Run** → Compare versions against test set

### Path D: Production Monitoring

1. **Instrument** → Add custom attributes for key data
2. **Configure** → Set up metrics for production concerns
3. **Run** → Continuously evaluate production traffic

---

## Common Questions

**"Do I need to use all four skills?"**
No. Instrumentation and evaluation-setup are essential. Dataset-curation is optional (for ground truth comparisons). Running-evaluations is needed to execute and view results.

**"What order should I use them?"**
Generally: Instrument → (optionally) Curate → Configure → Run. But you can revisit any step as needed.

**"Can I add more evaluations later?"**
Yes. You can always add new feedback functions and re-run evaluations on existing traces.

**"How do I know if my app is a RAG or Agent?"**
- **RAG**: Retrieves documents/context, generates grounded responses
- **Agent**: Uses tools, makes decisions, may involve planning

If your app does both (e.g., agentic RAG), use metrics from both categories.

---

## Getting Help

If you're unsure which skill to use, describe your goal and I'll guide you to the right one.

## Known Compatibility Notes

### Deep Agents / LangGraph

- **Always use `TruGraph`** for LangGraph-based apps (including Deep Agents)
- The `.on_input()` and `.on_output()` feedback shortcuts require `RECORD_ROOT` spans
- Framework wrappers (TruGraph, TruChain) create these automatically
- Manual `@instrument(span_type=SpanType.AGENT)` will NOT work with selector shortcuts

### Pydantic Compatibility

Some LangGraph/Deep Agents versions use `NotRequired` type annotations that older Pydantic versions can't handle. If you see `PydanticForbiddenQualifier` errors, update to the latest TruLens version.

truera / trulens-evaluation-workflow

Install for your project team

Download skill

Enable skills in Claude

Upload to Claude

Install skill for Codex

Install skill for GitHub Copilot

Install skill for Google Antigravity

Tools I Recommend

ClaudeKit

Skill Content