I Tried chDB for Analytics — Here’s What I Learned

Mukilyavimali
Jan 14
5 min read

Hello everyone!

Welcome to another Interesting Tech Update on Analytics!

Recently, I came across an interesting update about chDB in the TDLR Newsletter, which motivated me to explore it further. Until now, I’ve mostly worked with server‑based SQL engines such as PostgreSQL, MySQL, and SQL Server. I’ve also used serverless SQL engines on many challenge platforms like DataLemur, LeetCode, and HackerRank.

These platforms typically use in‑process SQL engines, which don’t require any installation, server setup, network connection, or configuration. They are lightweight, fast, and easy to embed directly in a browser which makes them perfect for learning and practicing.

What is chDB?

chDB is a super‑fast, in‑process SQL engine for Python that runs the ClickHouse database engine directly inside the Python program without installing any server.

It gives you the power of ClickHouse (one of the fastest analytical databases in the world) but in a lightweight, serverless form, like DuckDB.

Now Let’s have a look at the chDB Features and History and little bit of ClickHouse database engine.

chDB Feature:

1. chDB is an In-Process SQL OLAP engine.

“In‑process” means the SQL engine runs inside your Python program, not as a separate server.
“OLAP engine” means it is optimized for analytics, not transactions.
It uses the ClickHouse engine, which is one of the fastest analytical databases in the world.

2. No Need to install ClickHouse

We don’t need to install the ClickHouse Server instead we need to only install the Python Library, and we can run the SQL Query immediately.

3. ClickHouse uses Memory view from Python

Means ClickHouse is written in C++ and Pyhton is slower so copying data between them is expensive.
So chDB uses Python’s memory view which reduces the copying overhead and increases the query speed.

4. chDB supports almost 60+ data formats like CSV, JSON, PARQUET etc.,

we can query any file directly without convert it into a Pandas Data frame for Querying purposes.

5. It supports Python DB API 2.0

This means chDB behaves like other Python database libraries such as:

psycopg2 (PostgreSQL)
Sqlite3
Mysql‑connector

6. chDB supports the following biding languages like Python, Go, Rust, NodeJS, Bun, C and C++.

Architecture Diagram of chDB:

Basically, it is a three-layer architecture with

CoreLayer- ClickHouse Engine
Bridge Layer- Python Binding Layer
Inout / Output Layer to handle the different file formats

Below is the architecuture diagram of chDB.

Image Source: https://pypi.org/project/chdb/

chdb was introduced in 2023 and upgraded till now with Performance, Memory handling, SQL Capability and File format support.

Installation in Notebooks:

Currently, chdb supports Python 3.8 to 3.12 on macOS and Linux (x86_64 and ARM64).

Use the below Command for installation:

Pip install chdb

Then we can import the python library

Import chdb

How to Query the csv file using chdb:

Chdb is installed and imported as below,

Now we can query the csv that we have locally in the machine. To query the syntax is,

“import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

import chdb”

we need to import all the libraries we need for our analytics.

To query the csv file which is locally present in the collab content:

sale = chdb.query(""

SELECT *

FROM file('/content/sales_data.csv', 'CSVWithNames')

""", "dataframe")

You should have the csv file that you want to use it in the collab content locally if you are using Google Collab. It differs if you use different Python platform or IDE’s. Here I have used Google Collab.

The sale data is now in the data frame called sale. Same way we can query multiple format files, backup files, etc., using chdb without need to install database server engine. And the query would be much faster than the latter way.

To display the sale data frame:

print(sale)

Same way we can get the region wise sales using chDB and visually display it quickly for the EDA Purpose using SQL query in Python.

To get the region wise total sales:

regionsale = chdb.query("""

SELECT

region,

SUM(quantity * unit_price) AS total_sales

FROM file('/content/sales_data.csv', 'CSVWithNames')

GROUP BY region

ORDER BY total_sales DESC

""", "dataframe")

# To Print the region wise sales

print(regionsale)

The output would be,

To display visually,

plt.bar(regionsale['region'], regionsale['total_sales'], color='skyblue')

for i, v in enumerate(regionsale['total_sales']):

plt.text(i, v, f"{v:.1f}", ha='center', va='bottom')

plt.title("Region-wise Total Sales")

plt.xlabel("Region")

plt.ylabel("Total Sales ($)")

plt.grid(axis='y', linestyle='--', alpha=0.6)

plt.show()

The output would be,

This is how we can do the analytical task using the chDB library easily. This will undoubtedly improve your performance time faster than you image.

Chdb Limitations:

Even though chDB supports 60+ formats, its power comes from focusing only on structured, analytics‑friendly files. This is both a capability and a limitation for chdb.

Chdb supports flat files like csv, tsv, delimited files.
It supports semi structured files like JSON, NDJSON.
It supports binary files such as parquet.
It can handle compressed files even like zip files.
It can handle remote sources like URL’s, S3 bucket and even a multiple file folder.
But it cannot handle dump files, back up files and any kind of password protected files.

Advantages of chdb:

We can instantly SQL Query on the supporting files without the SQL Infrastructure.
There is Zero setup and Zero server is needed.
The ClickHouse technology is giving us a high-speed aggregation and joins on /large datasets.
It is perfect for notebooks like Jupyter and Google Collab when you want to see a quick insight.
It is lightweight and In-Process and portable, there is no admin permission or any other permission issues.
It is a great fit for ETL and preprocessing which filter, join, aggregate and clean data before feeding it into ML or dashboard.
It is Read-only by design, so data integrity is 100 percent well maintained.

Where Does chDB Fit in the analytics Ecosystem?

When I started exploring with chDB, the one big question in my mind was:

“Where does it exactly fit among the tools like Pandas, DuckDB, SQLite and ClickHouse?”

Now, after trying out the ream analytics workflow my understanding is:

Pandas is the default library for data work in Python. It is great for loading, cleaning and transforming data in memory, but chDB is sitting beside it is not replacing pandas instead it runs heavy SQL aggregations and joins on top of the data frames or files even faster.
DuckDB is a general-purpose embedded OLAP database for small and medium workloads and integration whereas chDB is a ClickHouse-grade speed and SQL but you don’t need to manage a Click House.
ClickHouse is a server-based OLAP database built for huge, production-grade workloads and real time analytics whereas chDB has no server, no cluster and no network. It is like a gateway to the ClickHouse, we need to install the library and run the SQL inside Python.
Traditional SQL Databases are OLTP workloads but chDB is meant for Analytical workloads, no CRUD operations supported here.

Conclusion:

Spending hardly few minutes with chDB, one thing was obvious: modern analytics doesn’t have to mean heavy servers, complicated installs, or big database setups anymore. Tools like chDB show just how far in-process SQL engines have come and give you ClickHouse-level speed, Python simplicity, and the ability to query files directly, without any infrastructure hassle.

chDB doesn’t try to replace PostgreSQL, MySQL, or a full ClickHouse server. Instead, it fills a gap like fast, lightweight, file-based analytics that works right inside notebooks and Python workflows.

It’s great for quick EDA, teaching SQL, prototyping ideas, or running small local ETL jobs. For analysts, students, and data scientists who just want instant SQL power without the setup pain, chDB feels like a breath of fresh air. You can focus on the data and ideas not the infrastructure.

Happy Learning!!!

References: TDLR NewsLetter, https://pypi.org/project/chdb/1.4.1/

Welcome
to NumpyNinja Blogs

I Tried chDB for Analytics — Here’s What I Learned

Hello everyone!

Welcome to another Interesting Tech Update on Analytics!

What is chDB?

chDB Feature:

Architecture Diagram of chDB:

Installation in Notebooks:

How to Query the csv file using chdb:

Chdb Limitations:

Advantages of chdb:

Where Does chDB Fit in the analytics Ecosystem?

Conclusion:

Recent Posts

Welcome to NumpyNinja Blogs

Hello everyone!

Welcome to another Interesting Tech Update on Analytics!

What is chDB?

chDB Feature:

Architecture Diagram of chDB:

Installation in Notebooks:

How to Query the csv file using chdb:

Chdb Limitations:

Advantages of chdb:

Where Does chDB Fit in the analytics Ecosystem?

Conclusion:

Welcome
to NumpyNinja Blogs