ArtificialAnalysis / data-analysis

High-performance data analysis using Polars - load, transform, aggregate, visualize and export tabular data. Use for CSV/JSON/Parquet processing, statistical analysis, time series, and creating charts.

0 views
0 installs

Skill Content

---
name: data_analysis
description: High-performance data analysis using Polars - load, transform, aggregate, visualize and export tabular data. Use for CSV/JSON/Parquet processing, statistical analysis, time series, and creating charts.
---

# Data Analysis Skill

Comprehensive data analysis toolkit using **Polars** - a blazingly fast DataFrame library. This skill provides instructions, reference documentation, and ready-to-use scripts for common data analysis tasks.

### Iteration Checkpoints

| Step | What to Present | User Input Type |
|------|-----------------|-----------------|
| Data Loading | Shape, columns, sample rows | "Is this the right data?" |
| Data Exploration | Summary stats, data quality issues | "Any columns to focus on?" |
| Transformation | Before/after comparison | "Does this transformation look correct?" |
| Analysis | Key findings, charts | "Should I dig deeper into anything?" |
| Export | Output preview | "Ready to save, or any changes?" |

## Quick Start

```python
import polars as pl
from polars import col

# Load data
df = pl.read_csv("data.csv")

# Explore
print(df.shape, df.schema)
df.describe()

# Transform and analyze
result = (
    df.filter(col("value") > 0)
    .group_by("category")
    .agg(col("value").sum().alias("total"))
    .sort("total", descending=True)
)

# Export
result.write_csv("output.csv")
```

## When to Use This Skill

- Loading datasets (CSV, JSON, Parquet, Excel, databases)
- Data cleaning, filtering, and transformation
- Aggregations, grouping, and pivot tables
- Statistical analysis and summary statistics
- Time series analysis and resampling
- Joining and merging multiple datasets
- Creating visualizations and charts
- Exporting results to various formats

## Skill Contents

### Reference Documentation

Detailed API reference and patterns for specific operations:

- `reference/loading.md` - Loading data from all supported formats
- `reference/transformations.md` - Column operations, filtering, sorting, type casting
- `reference/aggregations.md` - Group by, window functions, running totals
- `reference/time_series.md` - Date parsing, resampling, lag features
- `reference/statistics.md` - Correlations, distributions, hypothesis testing setup
- `reference/visualization.md` - Creating charts with matplotlib/plotly

### Ready-to-Use Scripts

Executable Python scripts for common tasks:

- `scripts/explore_data.py` - Quick dataset exploration and profiling
- `scripts/summary_stats.py` - Generate comprehensive statistics report

## Core Patterns

### Loading Data

```python
# CSV (most common)
df = pl.read_csv("data.csv")

# Lazy loading for large files
df = pl.scan_csv("large.csv").filter(col("x") > 0).collect()

# Parquet (recommended for large datasets)
df = pl.read_parquet("data.parquet")

# JSON
df = pl.read_json("data.json")
df = pl.read_ndjson("data.ndjson")  # Newline-delimited
```

### Filtering and Selection

```python
# Select columns
df.select("col1", "col2")
df.select(col("name"), col("value") * 2)

# Filter rows
df.filter(col("age") > 25)
df.filter((col("status") == "active") & (col("value") > 100))
df.filter(col("name").str.contains("Smith"))
```

### Transformations

```python
# Add/modify columns
df = df.with_columns(
    (col("price") * col("qty")).alias("total"),
    col("date_str").str.to_date("%Y-%m-%d").alias("date"),
)

# Conditional values
df = df.with_columns(
    pl.when(col("score") >= 90).then(pl.lit("A"))
    .when(col("score") >= 80).then(pl.lit("B"))
    .otherwise(pl.lit("C"))
    .alias("grade")
)
```

### Aggregations

```python
# Group by
df.group_by("category").agg(
    col("value").sum().alias("total"),
    col("value").mean().alias("avg"),
    pl.len().alias("count"),
)

# Window functions
df.with_columns(
    col("value").sum().over("group").alias("group_total"),
    col("value").rank().over("group").alias("rank_in_group"),
)
```

### Exporting

```python
df.write_csv("output.csv")
df.write_parquet("output.parquet")
df.write_json("output.json", row_oriented=True)
```

## Best Practices

1. **Use lazy evaluation** for large datasets: `pl.scan_csv()` + `.collect()`
2. **Filter early** to reduce data volume before expensive operations
3. **Select only needed columns** to minimize memory usage
4. **Prefer Parquet** for storage - faster I/O, better compression
5. **Use `.explain()`** to understand and optimize query plans