ClickHouse: The Fastest Database in the World You Probably Don't Know

Dealing with problems analyzing large volumes of data is no novelty for anyone today. But many people don’t know that solutions already exist for this problem. One of them is the ClickHouse database. In this article we’ll look at the two main categories of existing databases and also go deeper into how ClickHouse works.

As mentioned earlier, today there are two major classifications for databases:

OLTP

Online Transaction Processing.

Normalized Data (distributed across entities/tables);
Perfect for mutable data (frequent Updates and Deletes);
Transactional.

OLAP

Online Analytical Processing.

Denormalized Data (all information resides in the same entity/table);
Perfect for immutable data;
Analytical;
Superior performance in large aggregations, e.g.: generating daily, monthly, annual charts and indicators.

Another detail — not a rule but very common in OLAP databases — is the absence (or inefficiency) of data modification features for data already in the database (the famous UPDATE and DELETE).

ClickHouse does have these features, but they are asynchronous operations. According to them:

Although ClickHouse is aimed at high-volume analytical workloads, it is possible, in some situations, to modify or delete existing data. These operations are labeled “mutations” and are executed using the ALTER TABLE command.

Before going deeper into how ClickHouse works, let’s take a quick look at how it came to be.

History

2008-2012

It was born for the Yandex.Metrica project, today the 3rd largest web analytics platform in the world, trailing only Google Analytics and Facebook Pixel — hence the name ClickHouse, “house of clicks”. Yandex is the most used search engine in Russia.

2014

Yandex.Metrica was responsible for a data flow of 20 billion events daily. Storing more than 20.3 trillion rows in 2 PB of data (in CSV that’s equivalent to 17 PB).

2016

In 2016, the ClickHouse project was released as open source software under the Apache 2 license.

Why ClickHouse?

Now that we understand how it came about and some of its peculiarities, let’s go deeper and understand why you should choose it over other databases in the same category.

Performance

ClickHouse offers the best query performance in the industry while significantly reducing storage requirements through the use of columnar storage and data compression.

Below we can see some benchmark tests comparing ClickHouse with MySQL and PostgreSQL.

You can see the complete benchmark at this link.

What gives it absurd performance (in addition to significantly reducing storage space) is that ClickHouse is a columnar database. We can see below how much this makes a difference in query times.

ROW-oriented database (PostgreSQL, MySQL, SQLServer…)

COLUMN-oriented database (ClickHouse)

I don’t know about you, but benchmark tests to me are like politician ads — they promise a lot, but I only believe what I see. So let’s look at a test comparing ClickHouse and an Oracle database in a real production scenario.

Below are some tests comparing them.

First, let’s understand our scenario: we’re working with electronic invoice data. Both tables in both databases have similar structures (same columns) and similar amounts of data (rows), as can be seen below.

And yes, the Oracle database is properly indexed and optimized. For curiosity, it is on a server with double the capacity (memory and processing) of the server running ClickHouse.

As we can see, the time difference in each query was absurd — 177ms for ClickHouse and 17s for Oracle.

Grouping by month also made virtually no difference in ClickHouse, with a query time of 345ms versus 18s in Oracle.

Another incredible feature is the native integration ClickHouse has with AWS S3, allowing you to save query results to files directly in S3, as well as read those files later in a very simple manner. As you can see in the image above, it took only about 6 seconds to generate a CSV file with hundreds of thousands of rows, while on Oracle I had to cancel the query after waiting more than 2 hours for the database to fetch the data.

Scalability

Used in production by companies like Uber, eBay, Cloudflare, Spotify, Deutsche Bank (Germany’s largest bank), etc. With linear horizontal and vertical scalability. From single-server deployments to clusters with many thousands of nodes.

Reliability

ClickHouse presents best-in-class availability. There are no single points of failure, with the architecture supporting multi-master replication. With effective performance in multi-region configurations.

Security

ClickHouse comes with enterprise-level security features and failsafe mechanisms that protect against data corruption, application bugs, and human errors. All performed actions are logged.

For more details I recommend checking the official ClickHouse documentation.

Leave a comment below! Do you believe ClickHouse has everything it takes to be the database of the modern data stack, or are there better alternatives?