wshobson / distributed-tracing

Install for your project team

Run this command in your project directory to install the skill for your entire team:

mkdir -p .claude/skills/distributed-tracing && curl -L -o skill.zip "https://fastmcp.me/Skills/Download/454" && unzip -o skill.zip -d .claude/skills/distributed-tracing && rm skill.zip

New-Item -Path ".claude/skills/distributed-tracing" -ItemType Directory -Force; Invoke-WebRequest -Uri "https://fastmcp.me/Skills/Download/454" -OutFile "skill.zip"; Expand-Archive -Path "skill.zip" -DestinationPath ".claude/skills/distributed-tracing" -Force; Remove-Item "skill.zip"

Project Skills

This skill will be saved in .claude/skills/distributed-tracing/ and checked into git. All team members will have access to it automatically.

Important: Please verify the skill by reviewing its instructions before using it.

Install skill for Codex

Run one of these commands to install the skill depending on your needs:

Project Local ($CWD/.codex/skills)

mkdir -p .codex/skills/distributed-tracing && curl -L -o skill.zip "https://fastmcp.me/Skills/Download/454" && unzip -o skill.zip -d .codex/skills/distributed-tracing && rm skill.zip

New-Item -Path ".codex/skills/distributed-tracing" -ItemType Directory -Force; Invoke-WebRequest -Uri "https://fastmcp.me/Skills/Download/454" -OutFile "skill.zip"; Expand-Archive -Path "skill.zip" -DestinationPath ".codex/skills/distributed-tracing" -Force; Remove-Item "skill.zip"

User Global (~/.codex/skills)

mkdir -p ~/.codex/skills/distributed-tracing && curl -L -o skill.zip "https://fastmcp.me/Skills/Download/454" && unzip -o skill.zip -d ~/.codex/skills/distributed-tracing && rm skill.zip

New-Item -Path "$HOME/.codex/skills/distributed-tracing" -ItemType Directory -Force; Invoke-WebRequest -Uri "https://fastmcp.me/Skills/Download/454" -OutFile "skill.zip"; Expand-Archive -Path "skill.zip" -DestinationPath "$HOME/.codex/skills/distributed-tracing" -Force; Remove-Item "skill.zip"

Scope	Location	Suggested Use
REPO	`$CWD/.codex/skills`	Project directory. Teams can check in skills most relevant to a working folder here.
REPO	`$CWD/../.codex/skills`	A folder above CWD. Organizations can check in skills relevant to a shared area.
REPO	`$REPO_ROOT/.codex/skills`	Top-most root folder. Relevant to everyone using the repository.
USER	`$CODEX_HOME/skills`	Personal folder (`~/.codex/skills`). Curate skills that apply to any repository.

Install skill for GitHub Copilot

Run one of these commands to install the skill depending on your needs:

Project (.github/skills)

mkdir -p .github/skills/distributed-tracing && curl -L -o skill.zip "https://fastmcp.me/Skills/Download/454" && unzip -o skill.zip -d .github/skills/distributed-tracing && rm skill.zip

New-Item -Path ".github/skills/distributed-tracing" -ItemType Directory -Force; Invoke-WebRequest -Uri "https://fastmcp.me/Skills/Download/454" -OutFile "skill.zip"; Expand-Archive -Path "skill.zip" -DestinationPath ".github/skills/distributed-tracing" -Force; Remove-Item "skill.zip"

Personal (~/.copilot/skills)

mkdir -p ~/.copilot/skills/distributed-tracing && curl -L -o skill.zip "https://fastmcp.me/Skills/Download/454" && unzip -o skill.zip -d ~/.copilot/skills/distributed-tracing && rm skill.zip

New-Item -Path "$HOME/.copilot/skills/distributed-tracing" -ItemType Directory -Force; Invoke-WebRequest -Uri "https://fastmcp.me/Skills/Download/454" -OutFile "skill.zip"; Expand-Archive -Path "skill.zip" -DestinationPath "$HOME/.copilot/skills/distributed-tracing" -Force; Remove-Item "skill.zip"

Scope	Location	Suggested Use
Project	`.github/skills/`	Repository-specific skills. Checked into git for the whole team.
Personal	`~/.copilot/skills/`	Personal skills available across all your projects.

Install skill for Google Antigravity

Run one of these commands to install the skill depending on your needs:

Workspace (.agent/skills)

mkdir -p .agent/skills/distributed-tracing && curl -L -o skill.zip "https://fastmcp.me/Skills/Download/454" && unzip -o skill.zip -d .agent/skills/distributed-tracing && rm skill.zip

New-Item -Path ".agent/skills/distributed-tracing" -ItemType Directory -Force; Invoke-WebRequest -Uri "https://fastmcp.me/Skills/Download/454" -OutFile "skill.zip"; Expand-Archive -Path "skill.zip" -DestinationPath ".agent/skills/distributed-tracing" -Force; Remove-Item "skill.zip"

Global (~/.gemini/antigravity/skills)

mkdir -p ~/.gemini/antigravity/skills/distributed-tracing && curl -L -o skill.zip "https://fastmcp.me/Skills/Download/454" && unzip -o skill.zip -d ~/.gemini/antigravity/skills/distributed-tracing && rm skill.zip

New-Item -Path "$HOME/.gemini/antigravity/skills/distributed-tracing" -ItemType Directory -Force; Invoke-WebRequest -Uri "https://fastmcp.me/Skills/Download/454" -OutFile "skill.zip"; Expand-Archive -Path "skill.zip" -DestinationPath "$HOME/.gemini/antigravity/skills/distributed-tracing" -Force; Remove-Item "skill.zip"

Scope	Location	Suggested Use
Workspace	`.agent/skills/`	Workspace-specific skills for project workflows and conventions.
Global	`~/.gemini/antigravity/skills/`	Personal skills available across all workspaces.

Implement distributed tracing with Jaeger and Tempo to track requests across microservices and identify performance bottlenecks. Use when debugging microservices, analyzing request flows, or implementing observability for distributed systems.

Coding

39 views

0 installs

Tools I Recommend

ClaudeKit

Sponsor

Production-ready AI subagents automate your development & marketing workflows. Build in hours, not weeks.

Source: https://github.com/wshobson/agents/tree/main/plugins/observability-monitoring/skills/distributed-tracing

Skill Content

---
name: distributed-tracing
description: Implement distributed tracing with Jaeger and Tempo to track requests across microservices and identify performance bottlenecks. Use when debugging microservices, analyzing request flows, or implementing observability for distributed systems.
---

# Distributed Tracing

Implement distributed tracing with Jaeger and Tempo for request flow visibility across microservices.

## Purpose

Track requests across distributed systems to understand latency, dependencies, and failure points.

## When to Use

- Debug latency issues
- Understand service dependencies
- Identify bottlenecks
- Trace error propagation
- Analyze request paths

## Distributed Tracing Concepts

### Trace Structure
```
Trace (Request ID: abc123)
  ↓
Span (frontend) [100ms]
  ↓
Span (api-gateway) [80ms]
  ├→ Span (auth-service) [10ms]
  └→ Span (user-service) [60ms]
      └→ Span (database) [40ms]
```

### Key Components
- **Trace** - End-to-end request journey
- **Span** - Single operation within a trace
- **Context** - Metadata propagated between services
- **Tags** - Key-value pairs for filtering
- **Logs** - Timestamped events within a span

## Jaeger Setup

### Kubernetes Deployment

```bash
# Deploy Jaeger Operator
kubectl create namespace observability
kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.51.0/jaeger-operator.yaml -n observability

# Deploy Jaeger instance
kubectl apply -f - <<EOF
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
  namespace: observability
spec:
  strategy: production
  storage:
    type: elasticsearch
    options:
      es:
        server-urls: http://elasticsearch:9200
  ingress:
    enabled: true
EOF
```

### Docker Compose

```yaml
version: '3.8'
services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "5775:5775/udp"
      - "6831:6831/udp"
      - "6832:6832/udp"
      - "5778:5778"
      - "16686:16686"  # UI
      - "14268:14268"  # Collector
      - "14250:14250"  # gRPC
      - "9411:9411"    # Zipkin
    environment:
      - COLLECTOR_ZIPKIN_HOST_PORT=:9411
```

**Reference:** See `references/jaeger-setup.md`

## Application Instrumentation

### OpenTelemetry (Recommended)

#### Python (Flask)
```python
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from flask import Flask

# Initialize tracer
resource = Resource(attributes={SERVICE_NAME: "my-service"})
provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(JaegerExporter(
    agent_host_name="jaeger",
    agent_port=6831,
))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

# Instrument Flask
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)

@app.route('/api/users')
def get_users():
    tracer = trace.get_tracer(__name__)

    with tracer.start_as_current_span("get_users") as span:
        span.set_attribute("user.count", 100)
        # Business logic
        users = fetch_users_from_db()
        return {"users": users}

def fetch_users_from_db():
    tracer = trace.get_tracer(__name__)

    with tracer.start_as_current_span("database_query") as span:
        span.set_attribute("db.system", "postgresql")
        span.set_attribute("db.statement", "SELECT * FROM users")
        # Database query
        return query_database()
```

#### Node.js (Express)
```javascript
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');

// Initialize tracer
const provider = new NodeTracerProvider({
  resource: { attributes: { 'service.name': 'my-service' } }
});

const exporter = new JaegerExporter({
  endpoint: 'http://jaeger:14268/api/traces'
});

provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();

// Instrument libraries
registerInstrumentations({
  instrumentations: [
    new HttpInstrumentation(),
    new ExpressInstrumentation(),
  ],
});

const express = require('express');
const app = express();

app.get('/api/users', async (req, res) => {
  const tracer = trace.getTracer('my-service');
  const span = tracer.startSpan('get_users');

  try {
    const users = await fetchUsers();
    span.setAttributes({ 'user.count': users.length });
    res.json({ users });
  } finally {
    span.end();
  }
});
```

#### Go
```go
package main

import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/jaeger"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
)

func initTracer() (*sdktrace.TracerProvider, error) {
    exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(
        jaeger.WithEndpoint("http://jaeger:14268/api/traces"),
    ))
    if err != nil {
        return nil, err
    }

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String("my-service"),
        )),
    )

    otel.SetTracerProvider(tp)
    return tp, nil
}

func getUsers(ctx context.Context) ([]User, error) {
    tracer := otel.Tracer("my-service")
    ctx, span := tracer.Start(ctx, "get_users")
    defer span.End()

    span.SetAttributes(attribute.String("user.filter", "active"))

    users, err := fetchUsersFromDB(ctx)
    if err != nil {
        span.RecordError(err)
        return nil, err
    }

    span.SetAttributes(attribute.Int("user.count", len(users)))
    return users, nil
}
```

**Reference:** See `references/instrumentation.md`

## Context Propagation

### HTTP Headers
```
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE
```

### Propagation in HTTP Requests

#### Python
```python
from opentelemetry.propagate import inject

headers = {}
inject(headers)  # Injects trace context

response = requests.get('http://downstream-service/api', headers=headers)
```

#### Node.js
```javascript
const { propagation } = require('@opentelemetry/api');

const headers = {};
propagation.inject(context.active(), headers);

axios.get('http://downstream-service/api', { headers });
```

## Tempo Setup (Grafana)

### Kubernetes Deployment

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: tempo-config
data:
  tempo.yaml: |
    server:
      http_listen_port: 3200

    distributor:
      receivers:
        jaeger:
          protocols:
            thrift_http:
            grpc:
        otlp:
          protocols:
            http:
            grpc:

    storage:
      trace:
        backend: s3
        s3:
          bucket: tempo-traces
          endpoint: s3.amazonaws.com

    querier:
      frontend_worker:
        frontend_address: tempo-query-frontend:9095
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tempo
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: tempo
        image: grafana/tempo:latest
        args:
          - -config.file=/etc/tempo/tempo.yaml
        volumeMounts:
        - name: config
          mountPath: /etc/tempo
      volumes:
      - name: config
        configMap:
          name: tempo-config
```

**Reference:** See `assets/jaeger-config.yaml.template`

## Sampling Strategies

### Probabilistic Sampling
```yaml
# Sample 1% of traces
sampler:
  type: probabilistic
  param: 0.01
```

### Rate Limiting Sampling
```yaml
# Sample max 100 traces per second
sampler:
  type: ratelimiting
  param: 100
```

### Adaptive Sampling
```python
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased

# Sample based on trace ID (deterministic)
sampler = ParentBased(root=TraceIdRatioBased(0.01))
```

## Trace Analysis

### Finding Slow Requests

**Jaeger Query:**
```
service=my-service
duration > 1s
```

### Finding Errors

**Jaeger Query:**
```
service=my-service
error=true
tags.http.status_code >= 500
```

### Service Dependency Graph

Jaeger automatically generates service dependency graphs showing:
- Service relationships
- Request rates
- Error rates
- Average latencies

## Best Practices

1. **Sample appropriately** (1-10% in production)
2. **Add meaningful tags** (user_id, request_id)
3. **Propagate context** across all service boundaries
4. **Log exceptions** in spans
5. **Use consistent naming** for operations
6. **Monitor tracing overhead** (<1% CPU impact)
7. **Set up alerts** for trace errors
8. **Implement distributed context** (baggage)
9. **Use span events** for important milestones
10. **Document instrumentation** standards

## Integration with Logging

### Correlated Logs
```python
import logging
from opentelemetry import trace

logger = logging.getLogger(__name__)

def process_request():
    span = trace.get_current_span()
    trace_id = span.get_span_context().trace_id

    logger.info(
        "Processing request",
        extra={"trace_id": format(trace_id, '032x')}
    )
```

## Troubleshooting

**No traces appearing:**
- Check collector endpoint
- Verify network connectivity
- Check sampling configuration
- Review application logs

**High latency overhead:**
- Reduce sampling rate
- Use batch span processor
- Check exporter configuration

## Reference Files

- `references/jaeger-setup.md` - Jaeger installation
- `references/instrumentation.md` - Instrumentation patterns
- `assets/jaeger-config.yaml.template` - Jaeger configuration

## Related Skills

- `prometheus-configuration` - For metrics
- `grafana-dashboards` - For visualization
- `slo-implementation` - For latency SLOs

wshobson / distributed-tracing

Install for your project team

Download skill

Enable skills in Claude

Upload to Claude

Install skill for Codex

Install skill for GitHub Copilot

Install skill for Google Antigravity

Tools I Recommend

ClaudeKit

Skill Content