---
name: duckdb-analytics-in-node
description: Use DuckDB inside Node.js for analytical queries against CSV/Parquet/JSON without spinning up a warehouse. Use when you need to aggregate, join, or window-function over millions of rows, when ingesting log data for ad-hoc analysis, or when building a fast embedded query engine for a desktop/CLI tool.
category: data-engineering
version: 0.1.0
tags: [duckdb, analytics, parquet, csv, sql]
recommended_npm: ["@duckdb/node-api", "apache-arrow"]
license: MIT
author: claude-code-skills
---

DuckDB is "SQLite for analytics" — embedded, zero-config, columnar. From Node it gives you Postgres-grade SQL, vectorized execution, and direct readers for Parquet, CSV, and JSON without a separate process.

## Setup

```bash
pnpm add @duckdb/node-api
```

```ts
import { DuckDBInstance } from "@duckdb/node-api";

const instance = await DuckDBInstance.create(":memory:");   // or "./analytics.db"
const conn = await instance.connect();
```

## Read CSV / Parquet / JSON directly

DuckDB reads files as if they were tables — no import step:

```ts
const result = await conn.run(`
  SELECT
    date_trunc('day', created_at) AS day,
    count(*) AS events,
    avg(latency_ms) AS p_avg,
    quantile_cont(latency_ms, 0.95) AS p95
  FROM 'logs/*.parquet'
  WHERE created_at >= '2026-04-01'
  GROUP BY 1
  ORDER BY 1
`);

for await (const row of result.getRowsJS()) {
  // { day: Date, events: bigint, p_avg: number, p95: number }
}
```

`'logs/*.parquet'` is a glob — DuckDB reads files in parallel.

## Streaming results (large outputs)

```ts
const reader = await conn.runAndReadAll(`SELECT * FROM 'big.parquet'`);
for (const chunk of reader.getChunks()) {
  // chunk is a vectorized batch — process as Arrow record batches
}
```

For huge results, prefer `streamAndRead()` with backpressure handling.

## Joining files of different formats

```sql
SELECT
  e.user_id,
  e.event_name,
  u.country
FROM 'events.parquet' e
JOIN 's3://bucket/users.csv' u USING (user_id)   -- yes, S3 directly
WHERE e.event_name = 'purchase'
LIMIT 1000;
```

Set credentials once: `SET s3_access_key_id='...'; SET s3_secret_access_key='...';`.

## Persistent analytics DB

```ts
const instance = await DuckDBInstance.create("./warehouse.duckdb");
// CREATE TABLE / INSERT as usual
// Survives across runs; binary file format is portable.
```

DuckDB's storage format is now stable (v1+); safe to commit small fixture databases.

## Anti-patterns

- ❌ Loading 10M rows into JS arrays via `result.getRowsJS()` — defeats the columnar advantage. Aggregate in SQL, return summaries.
- ❌ Re-reading the same Parquet file in a loop — load it as a table once, query that.
- ❌ Using DuckDB as a transactional database (high write concurrency) — it's analytical (OLAP), not OLTP. Use Postgres for that.
- ❌ Skipping `LIMIT` during exploration — easy to OOM your shell with a 50M-row print.
- ❌ Running DuckDB in the same process as a high-QPS HTTP server without isolation — long queries block the event loop. Spawn a worker thread or job runner.
- ❌ Using `string` columns where `enum` would do — DuckDB stores enums as integers (much faster joins).

## Quality gates

- Queries scanning < 100M rows complete in < 1s on a laptop.
- Memory usage stays bounded (set `SET memory_limit='4GB'` and watch).
- Parquet outputs use compression (`COPY (...) TO 'out.parquet' (FORMAT PARQUET, COMPRESSION ZSTD)`).
- Schema explicit (`CREATE TABLE` with types) — avoids inference drift between runs.
