---
name: data-analyst
description: A data analyst skill that guides Claude through structured, professional data analysis workflows. Use this skill when the user requests data analysis work including analyze, query, dashboard, metrics, EDA, cohort, funnel, or A/B testing.
---

# CrushData AI - Data Analyst Skill

A data analyst intelligence skill that guides you through structured, professional data analysis workflows.

## How to Use This Skill

When user requests data analysis work (analyze, query, dashboard, metrics, EDA, cohort, funnel, A/B test), follow this workflow:

### Step 1: Discovery Protocol (MANDATORY)

**Before writing any code, ask the user:**

```
## Discovery Questions

1. **Business Context**
   - What business question should this analysis answer?
   - Who is the audience? (Executive, Analyst, Engineer)
   - What action will this analysis inform?

2. **Data Context**
   - Which tables/databases contain the relevant data?
   - What time range should I analyze?
   - Any known data quality issues?

3. **Metric Definitions**
   - How does YOUR company define the key metrics?
   - Any filters to apply? (exclude test users, internal accounts?)
   - What timezone should I use for dates?

4. **Script Organization**
    - Save all analysis scripts in an `analysis/` folder.
    - Create this folder if it does not exist.
   
5. **Python Environment**
   - Check for `venv` or `.venv`. If missing, `python3 -m venv venv`.
   - Install dependencies in venv (`venv/bin/pip install`).
   - Run scripts using venv python (`venv/bin/python`).

6. **Reports**
   - Save all profiling, validation, and findings to `reports/` folder.
   - Create this folder if it does not exist.
```

### 1b. Secure Data Access

> **Credentials are stored in the project's `.env` file** - never hardcoded.

- **Check Connections**: Run `npx crushdataai connections` first.
- **Missing Data?**: If the data source is not listed, **INSTRUCT** the user to run:
  `npx crushdataai connect`
- **Discover Schema**: Use `npx crushdataai schema <connection> [table]` to see available tables and columns.
- **Get Code**: **ALWAYS** use `npx crushdataai snippet <name>` to get loading code (uses env vars).
- **Load .env**: Scripts use `os.environ["VAR"]`. Ensure `python-dotenv` is installed and `.env` is loaded.
- **Security**: **DO NOT** ask user to copy/move files to `data/`. Treat connected data as read-only.

### Step 2: Search Relevant Domains

Use `search.py` to gather comprehensive information:

```bash
python3 .claude/skills/data-analyst/scripts/search.py "<query>" --domain <domain> [-n 3]
```

**Available domains:**
| Domain | Use Case |
|--------|----------|
| `workflow` | Step-by-step analysis process |
| `metric` | Metric definitions and formulas |
| `chart` | Visualization recommendations |
| `cleaning` | Data quality patterns |
| `sql` | SQL patterns (window functions, cohorts) |
| `python` | pandas/polars code snippets |
| `database` | PostgreSQL, BigQuery, Snowflake tips |
| `report` | Dashboard UX guidelines |
| `validation` | Common mistakes to avoid |

**Industry-specific search:**
```bash
python3 .claude/skills/data-analyst/scripts/search.py "<query>" --industry saas|ecommerce|finance|marketing
```

**Recommended search order:**
1. `workflow` - Get the step-by-step process for this analysis type
2. `metric` or `--industry` - Get relevant metric definitions
3. `sql` or `python` - Get code patterns for implementation
4. `chart` - Get visualization recommendations
5. `validation` - Check for common mistakes to avoid

### Step 3: Data Profiling (MANDATORY Before Analysis)

Before any analysis, run profiling:

**Python:**
```python
print(f"Shape: {df.shape}")
print(f"Date range: {df['date'].min()} to {df['date'].max()}")
print(f"Missing values:\n{df.isnull().sum()}")
print(f"Sample:\n{df.sample(5)}")
```

**SQL:**
```sql
SELECT 
    COUNT(*) as total_rows,
    COUNT(DISTINCT user_id) as unique_users,
    MIN(date) as min_date,
    MAX(date) as max_date
FROM table;
```

**Report findings to user before proceeding:**
> "I found X rows, Y unique users, date range from A to B. Does this match your expectation?"

### Step 3b: Data Cleaning & Transformation (ETL)

**Address data quality issues found in Step 3:**
1. **Cleaning**: Handle missing values, remove invalid duplicates, fix types.
2. **Transformation**: Standardize categories, parse dates, normalize text.
3. **Feature Engineering**: Create calculated columns needed for metrics.

*Tip: Save cleaning scripts to `etl/` folder. Save processed data to `data/processed/`.*

### Step 4: Execute Analysis with Validation

**Before JOINs:**
- Run on 100 rows first
- Check: Did row count change unexpectedly?
- Ask: "The join produced X rows from Y. Expected?"

**Before Aggregations:**
- Check for duplicates that could inflate sums
- Verify granularity: "Is this one row per user per day?"
- Ask: "Total = $X. Does this seem reasonable?"

**Before Delivery:**
- Sanity check order of magnitude
- Compare to benchmark or prior period
- Present for user validation before finalizing

---

## Workflow Reference

| Analysis Type | Search Command |
|--------------|----------------|
| EDA | `--domain workflow` query "exploratory data analysis" |
| Dashboard | `--domain workflow` query "dashboard creation" |
| A/B Test | `--domain workflow` query "ab test" |
| Cohort | `--domain workflow` query "cohort analysis" |
| Funnel | `--domain workflow` query "funnel analysis" |
| Time Series | `--domain workflow` query "time series" |
| Segmentation | `--domain workflow` query "customer segmentation" |
| Data Cleaning | `--domain workflow` query "data cleaning" |

---

## Common Rules

1. **Always ask before assuming** - Metric definitions vary by company
2. **Profile data first** - Never aggregate without understanding the data
3. **Validate results** - Check totals, compare to benchmarks
4. **Document assumptions** - State what filters and definitions you used
5. **Show your work** - Explain the logic behind complex queries

---

## Step 5: Generate Dashboard Output

**After analysis, save results for dashboard visualization:**

1. **Search for best chart type:**
   ```bash
   python3 .claude/skills/data-analyst/scripts/search.py "<metric description>" --domain chart
   ```

2. **Create dashboard JSON:**
   ```python
   from pathlib import Path
   import json
   from datetime import datetime
   
   Path("reports/dashboards").mkdir(parents=True, exist_ok=True)
   
   dashboard = {
       "metadata": {
           "title": "Analysis Dashboard",
           "generatedAt": datetime.now().isoformat(),
           "dataRange": f"{start_date} to {end_date}"
       },
       "kpis": [
           {"id": "kpi-1", "label": "Total Revenue", "value": "$50,000", "trend": "+12%", "trendDirection": "up"}
       ],
       "charts": [
           {
               "id": "chart-1",
               "type": "line",  # line, bar, pie, area, scatter, donut, table
               "title": "Monthly Trend",
               "data": {
                   "labels": ["Jan", "Feb", "Mar"],
                   "datasets": [{"label": "Revenue", "values": [10000, 15000, 25000]}]
               }
           }
       ]
   }
   
   with open("reports/dashboards/dashboard.json", "w") as f:
       json.dump(dashboard, f, indent=2)
   ```

### 5b. Making Charts Refreshable (Recommended)

To allow the user to refresh data directly from the dashboard:
1. Include a `query` object in the chart definition.
2. Run `npx crushdataai connections` to list available connection names (this is secure - no passwords shown).
3. Set `connection` to one of the listed names.
4. Set `sql` to the query used to generate the data.

> **SECURITY**: Never read `.env` directly to find connection names. Always use `npx crushdataai connections`.

```json
"query": {
    "connection": "my_postgres_db",
    "sql": "SELECT date, revenue FROM sales WHERE..."
}
```

**Database Specifics:**
- **SQL/Databases**: Provide the full SQL query.
- **Shopify**: Provide the resource name (e.g. `orders`).
- **CSV**: Provide the connection name. `sql` is ignored but required (set to "default").
- **MongoDB**: Provide the collection name in the `sql` field.

**Script-Based Refresh (for Python-aggregated charts):**

If your chart requires Python aggregation (e.g., grouping, custom calculations), use `script` instead of `connection`:

```json
"query": {
    "script": "analysis/my_dashboard_script.py"
}
```

When the user clicks Refresh, the CLI will **re-run your Python script**. The script should update the dashboard JSON file with fresh aggregated data.

> **TIP**: Use `script` for Shopify/API charts that need aggregation. Use `connection` + `sql` only for SQL databases where the query returns pre-formatted chart data.

3. **Tell user:**
   > "Dashboard ready! Run `npx crushdataai dashboard` to view."

---

## Pre-Delivery Checklist

Before presenting final results:

- [ ] Confirmed business question is answered
- [ ] Data was profiled and validated
- [ ] Metric definitions match user's expectations
- [ ] Sanity checks pass (order of magnitude, trends, etc.)
- [ ] Visualizations follow best practices (search `--domain chart`)
- [ ] Assumptions and filters are documented
