You Don't Need a Data Warehouse: DuckDB Is Enough

You Don't Need a Data Warehouse: DuckDB Is Enough

11 min de leitura

Let me guess: you have data in CSV or Parquet in S3, or maybe in a local folder. And someone in your company has already suggested that you need a “modern data warehouse” like Snowflake or BigQuery. After all, “we need to scale,” “we need performance,” “we need an enterprise solution.”

And with that comes: VPC configuration, credential management, IAM policies, cost control, query optimization, table partitioning, and a bill that can easily reach thousands of dollars per month.

Here’s an uncomfortable truth: in most cases, you don’t need any of that.

DuckDB can analyze your data directly — wherever it lives — without you moving a single byte. No server. No infrastructure. No compute cost. And probably faster than many “enterprise” solutions.

Let me show you why.

What Is DuckDB?

DuckDB is an embedded analytical database (OLAP). Think of it as “SQLite for analytics.”

Key features:

  • Zero configuration: No server to install or manage
  • In-process: Runs directly inside your Python, R, Node.js, etc. application
  • OLAP optimized: Columnar vectorized engine for extremely fast analytical queries
  • Direct file queries: CSV, Parquet, JSON — no need to import first
  • Remote queries: S3, HTTP(S), GCS — reads directly from the cloud
  • Native integration: Direct queries on Pandas DataFrames, Arrow, Polars
  • Free and open source: MIT License, no vendor lock-in

The fundamental difference: DuckDB brings the processing to the data, not the data to the processing.

Why You (Probably) Don’t Need a Data Warehouse

1. You’re Not at the Scale You Think You Are

Let’s be honest: most companies don’t have “big data.” They have “medium data” or even “small data that grew a bit.”

DuckDB can:

  • Process terabytes of data on a common laptop
  • Read gigabytes of compressed Parquet in seconds
  • Do joins and aggregations on billions of rows
  • Run on 500MB of RAM or on 500GB — it adapts

Unless you have tens of terabytes being queried simultaneously by hundreds of users, you probably don’t need a data warehouse.

2. You’re Paying to Move Data That’s Already Where It Needs to Be

Traditional flow:

  1. Data generated → S3/Cloud Storage
  2. ETL: Move data from S3 to Snowflake/BigQuery (💰 compute cost)
  3. Store duplicate data in the warehouse (💰 storage cost)
  4. Run queries (💰 compute cost again)

DuckDB flow:

  1. Data generated → S3/Cloud Storage
  2. Query directly on S3 (✅ zero extra cost)

Practical example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import duckdb

# Query directly on Parquet in S3 — no ETL, no import
result = duckdb.sql("""
    SELECT 
        product,
        SUM(amount) as total_sales,
        COUNT(*) as num_sales
    FROM read_parquet('s3://my-bucket/sales/*.parquet')
    WHERE date >= '2024-01-01'
    GROUP BY product
    ORDER BY total_sales DESC
    LIMIT 10
""").df()

print(result)

That’s it. No cluster configuration, no data import, no warehouse bill.

3. Operational Complexity Kills Productivity

With a traditional data warehouse:

  • ⏰ Days or weeks to provision and configure
  • 🔐 VPC, networking, security group configuration
  • 💳 Cost and budget management
  • 📊 Performance and query monitoring
  • 🔄 ETL pipeline maintenance
  • 👥 Team training on a new platform
  • 🐛 Debugging network, permissions, and quota limit issues

With DuckDB:

1
pip install duckdb

Done. Does your team know SQL? Then they already know how to use DuckDB.

Real Use Cases: DuckDB in Practice

Case 1: System Log Analysis

You have JSON logs in S3 and want to analyze error patterns:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import duckdb

# Install HTTPFS extension (once)
duckdb.sql("INSTALL httpfs")
duckdb.sql("LOAD httpfs")

# Configure S3 credentials
duckdb.sql("""
    SET s3_region='us-east-1';
    SET s3_access_key_id='<your-key>';
    SET s3_secret_access_key='<your-secret>';
""")

# Direct analysis on compressed JSON in S3
query = """
SELECT 
    DATE_TRUNC('hour', timestamp) as hour,
    level,
    COUNT(*) as total_events,
    COUNT(DISTINCT user_id) as affected_users
FROM read_json_auto('s3://production-logs/app-logs-*.json.gz')
WHERE level IN ('ERROR', 'CRITICAL')
  AND timestamp >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY DATE_TRUNC('hour', timestamp), level
ORDER BY hour DESC, total_events DESC
"""

result = duckdb.sql(query).df()

Performance: DuckDB reads only the necessary columns and applies filters during reading (predicate pushdown).

Case 2: Join from Multiple Sources

You have sales in Parquet, customers in CSV, products in a Pandas dataframe:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import duckdb
import pandas as pd

# Products already loaded in memory
products_df = pd.read_csv('products.csv')

# Mixed query: S3 + local + DataFrame
query = """
SELECT 
    s.date,
    p.category,
    p.product_name,
    COUNT(*) as num_sales,
    SUM(s.amount) as total_revenue,
    AVG(s.quantity) as avg_quantity
FROM read_parquet('s3://data/sales/2024/*.parquet') s
JOIN read_csv('customers.csv') c ON s.customer_id = c.id
JOIN products_df p ON s.product_id = p.id
WHERE c.segment = 'premium'
  AND p.category IN ('electronics', 'appliances')
GROUP BY s.date, p.category, p.product_name
ORDER BY total_revenue DESC
"""

result = duckdb.sql(query).df()

Note: DuckDB joins an S3 file, a local file, and a Pandas DataFrame without importing anything first.

Case 3: Fast Exploratory Data Analysis (EDA)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import duckdb

# Automatic statistical summary of large files
duckdb.sql("SUMMARIZE FROM read_parquet('sales-2024.parquet')").show()

# Output:
# ┌─────────────┬──────────────┬─────────┬──────────┬──────────┐
# │ column_name │ column_type  │   min   │   max    │   avg    │
# ├─────────────┼──────────────┼─────────┼──────────┼──────────┤
# │ date        │ DATE         │ 2024... │ 2024...  │   NULL   │
# │ amount      │ DECIMAL(10,2)│ 1.50    │ 9999.99  │  342.15  │
# │ quantity    │ INTEGER      │ 1       │ 500      │   12.3   │
# └─────────────┴──────────────┴─────────┴──────────┴──────────┘

Case 4: Infrastructure-Free Analytics Pipeline

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import duckdb
from datetime import datetime

# Create persistent connection
con = duckdb.connect('analytics.duckdb')

# Process data and materialize important results
con.sql("""
    CREATE TABLE daily_sales AS
    SELECT 
        DATE_TRUNC('day', date) as date,
        category,
        SUM(amount) as total_sales,
        COUNT(*) as num_transactions,
        COUNT(DISTINCT customer_id) as unique_customers
    FROM read_parquet('s3://raw-data/sales/*.parquet')
    WHERE date >= CURRENT_DATE - INTERVAL '90 days'
    GROUP BY DATE_TRUNC('day', date), category
""")

# Export to partitioned Parquet for later analyses
con.sql("""
    COPY daily_sales 
    TO 'processed_sales' 
    (FORMAT PARQUET, PARTITION_BY (date))
""")

# Quick query on processed data
result = con.sql("""
    SELECT * FROM daily_sales 
    WHERE date >= CURRENT_DATE - INTERVAL '7 days'
    ORDER BY total_sales DESC
""").df()

con.close()

DuckDB vs. Data Warehouses: Where DuckDB Wins

Criterion DuckDB Snowflake/BigQuery/Redshift
Initial setup pip install duckdb (5 seconds) Days/weeks of configuration
Monthly cost $0 $500 - $50,000+
Ad-hoc query latency Milliseconds Seconds (cold start)
Query on local files ✅ Native ❌ Needs upload
Query on S3/GCS ✅ Direct ⚠️ Needs import/external tables
Python/Pandas integration ✅ Zero-copy ⚠️ Needs export/import
Works offline ✅ Yes ❌ No
Vendor lock-in ✅ Zero ❌ High
Learning curve ✅ Standard SQL ⚠️ SQL + platform-specific

DuckDB vs. Data Warehouses: Where Data Warehouses Win

Let’s be fair. There are cases where you really need a data warehouse:

1. Massive Concurrency

  • Hundreds of users running complex queries simultaneously
  • DuckDB is single-process; it doesn’t scale horizontally

2. High-Frequency Mutable Data

  • Thousands of UPDATEs/DELETEs per second
  • DuckDB is optimized for OLAP (read-heavy), not OLTP

3. Complex Enterprise Governance

  • Row-level security with thousands of rules
  • Detailed native audit logs
  • DuckDB has basic security features

4. Integration with Specific Ecosystem

  • If you already have your entire stack on AWS and need native integration with Redshift Spectrum, QuickSight, etc.

5. Truly Massive Data with Simultaneous Queries

  • Hundreds of terabytes being queried by dozens of analysts at the same time
  • DuckDB processes terabytes, but is single-machine

The honest question: Are you really at that scale? Or are you anticipating a problem you may never have?

Performance: Is DuckDB Really Fast?

Don’t just take their word for it. See the official benchmarks:

TPC-H SF100 (100GB) — Typical data warehouse query:

  • DuckDB: ~10 seconds (modern laptop)
  • Traditional systems: minutes

ClickBench (100M rows) — Real analytical queries:

  • DuckDB: Among the 3 fastest alongside ClickHouse

Why is it so fast?

  1. Vectorized execution: Processes thousands of values per operation
  2. Columnar storage: Reads only necessary columns
  3. Parallel processing: Uses all CPU cores
  4. Smart compression: Less I/O
  5. Zero network: In-process eliminates serialization

Modern Architecture: DuckDB + Data Lake

Modern architecture doesn’t involve a centralized data warehouse. It involves:

Data Lake (S3/GCS)DuckDB for analytics

┌──────────────────────────────────────────┐
│         Object Storage (S3/GCS)          │
│  ┌──────────┐  ┌──────────┐  ┌─────────┐ │
│  │ Raw Data │  │ Curated  │  │ Serving │ │
│  │ (JSON)   │  │ (Parquet)│  │(Parquet)│ |
│  └──────────┘  └──────────┘  └─────────┘ │
└──────────────────────────────────────────┘
            ↓ Direct query
┌──────────────────────────────────────────┐
│              DuckDB (local)              │
│  • Python notebooks                      │
│  • APIs / Microservices                  │
│  • Batch jobs                            │
│  • BI tools (via ODBC)                   │
└──────────────────────────────────────────┘

Benefits:

  • ✅ Cheap storage (S3 = ~$0.023/GB/month)
  • ✅ Open format (Parquet)
  • ✅ Zero vendor lock-in
  • ✅ On-demand compute (only runs when needed)
  • ✅ Versioning with Delta Lake / Iceberg

Hybrid Pattern: When to Use Both

For some scenarios, it makes sense to use DuckDB + DW:

DuckDB for:

  • Local development and testing
  • ETL/ELT processing
  • Ad-hoc analysis
  • Rapid prototyping
  • Initial data exploration

Data Warehouse for:

  • Production dashboards with high concurrency
  • Queries for end users (non-technical)
  • Data requiring frequent updates

Example workflow:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# 1. Develop and test with local DuckDB
import duckdb

query = """
SELECT 
    category,
    DATE_TRUNC('month', date) as month,
    SUM(amount) as total
FROM read_parquet('s3://raw-data/*.parquet')
GROUP BY category, month
"""

# Test locally on sample
test_result = duckdb.sql(query + " LIMIT 1000").df()

# 2. When validated, materialize in warehouse for production
duckdb.sql(f"""
    COPY ({query}) 
    TO 's3://processed-data/monthly_sales.parquet'
    (FORMAT PARQUET)
""")

# 3. Warehouse loads the processed Parquet
# Much cheaper than processing in the warehouse directly

Powerful Integrations

Query PostgreSQL/MySQL Without Migration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import duckdb

# Attach existing database
duckdb.sql("INSTALL postgres")
duckdb.sql("LOAD postgres")

duckdb.sql("""
    ATTACH 'host=localhost user=user password=pass dbname=production' 
    AS pg (TYPE POSTGRES)
""")

# Join between PostgreSQL and S3 Parquet
result = duckdb.sql("""
    SELECT 
        p.name,
        s.total_sales
    FROM pg.products p
    JOIN read_parquet('s3://analytics/sales.parquet') s
        ON p.id = s.product_id
    ORDER BY s.total_sales DESC
    LIMIT 10
""").df()

Query Excel (Yes, Excel!)

1
2
3
4
5
6
7
8
9
duckdb.sql("INSTALL spatial")
duckdb.sql("LOAD spatial")

# Read Excel directly
result = duckdb.sql("""
    SELECT * 
    FROM st_read('sales_report.xlsx', layer='Sales')
    WHERE amount > 1000
""").df()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Text search on data
duckdb.sql("""
    CREATE TABLE documents AS 
    SELECT * FROM read_parquet('documents.parquet')
""")

duckdb.sql("PRAGMA create_fts_index('documents', 'id', 'content')")

results = duckdb.sql("""
    SELECT id, content, fts_main_documents.match_bm25(id, 'duckdb analytics') as score
    FROM documents
    WHERE score IS NOT NULL
    ORDER BY score DESC
""").df()

How to Get Started with DuckDB Today

1. Installation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Python
pip install duckdb

# R
install.packages("duckdb")

# Node.js
npm install duckdb

# Standalone CLI
curl -LO https://github.com/duckdb/duckdb/releases/download/v1.5.0/duckdb_cli-linux-amd64.zip
unzip duckdb_cli-linux-amd64.zip
./duckdb

2. First Script

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import duckdb

# Quick analysis of a large CSV
duckdb.sql("""
    SELECT 
        category,
        AVG(price) as avg_price,
        COUNT(*) as total_products
    FROM read_csv('products.csv', AUTO_DETECT=TRUE)
    GROUP BY category
    ORDER BY total_products DESC
""").show()

3. Persist Results

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Create persistent database
con = duckdb.connect('my_analytics.duckdb')

# Materialize table
con.sql("""
    CREATE TABLE sales AS
    SELECT * FROM read_parquet('s3://data/sales/*.parquet')
""")

# Fast queries later
result = con.sql("SELECT * FROM sales WHERE date > '2024-01-01'").df()

con.close()

4. Jupyter Integration

1
2
3
4
5
6
7
8
# In any Jupyter notebook
import duckdb

# Magic SQL (with duckdb-magic extension)
%load_ext duckdb_magic

# Now you can use SQL directly
SELECT * FROM read_parquet('data.parquet') LIMIT 10

Myths and Truths About DuckDB

❌ Myth: “DuckDB is only for small data”

✅ Truth: DuckDB efficiently processes terabytes. The limitation is hardware, not software.

❌ Myth: “No server means no scalability”

✅ Truth: Scales vertically very well. A modern server with 128GB RAM processes A LOT.

❌ Myth: “Can’t use it in production”

✅ Truth: Companies like GitHub, Hugging Face, and others use it in production.

❌ Myth: “Embedded SQL is slow”

✅ Truth: DuckDB frequently outperforms distributed systems for single-machine workloads.

❌ Myth: “It doesn’t have enterprise features”

✅ Truth: It has ACID transactions, MVCC, secondary indexes, full-text search, window functions, etc.

When NOT to Use DuckDB

Be clear about the limitations:

  1. You need high-frequency write concurrency → Use PostgreSQL
  2. Hundreds of users making queries simultaneously → Use a data warehouse
  3. You need a multi-node distributed system → Use Spark, ClickHouse, or data warehouse
  4. Compliance requires detailed native auditing → Use enterprise solutions
  5. Your team already masters and is satisfied with the current solution → Don’t change just to change

Conclusion: Question Complexity

The tech industry loves complexity. Vendors profit by selling solutions to problems you don’t have. Consultants earn money implementing over-engineered architectures.

Ask the honest question:

“Do I really need this, or am I just following what ’everyone does’?”

For most companies:

  • You don’t have petabytes of data
  • You don’t have thousands of concurrent users on analytics
  • You don’t need sub-second latency on terabyte aggregations
  • You do need agility, low cost, and simplicity

DuckDB offers that.

Start simple. Query your Parquets in S3 with DuckDB. If and when you really grow to the scale that needs Snowflake (and most never get there), migrating is easy — your data is already in open format.

But I bet you’ll find that DuckDB is enough. And enough is underrated.


Additional resources:

Next step: Install DuckDB now and run a query on a file you already have. It will literally take 2 minutes. Then tell me if you still think you need a data warehouse.