Tools to analyse slow queries in ClickHouse

At Contentsquare, we rely a lot on ClickHouse. We use it as the main storage of all our analytics products. We also use it for deeper analysis done our Data Analysis and the Data Science teams. Therefore a lot of people need to write ClickHouse queries. If ClickHouse is blazing fast, it can also be slow when badly used. In this article we will show you some tools that help us understand why a query is slow.

This article is a bit advanced and is meant for people already familiar with ClickHouse. Although some parts are linked you can read each of them independently.

Setup

Note that this article was written with ClickHouse v21.9.2 in mind, but most features have been available since v20. The common thread of all the tools we will see is the analysis of the following query:

SELECT count()
FROM table1
WHERE rand_val IN
(
SELECT rand_val
FROM table2
WHERE (rand_val != 50000)
) AND (rand_val != 4000)

This query contains a subquery and a few filters. Both tables contains 30 million rows.

If you want to reproduce what is done in the article, here are the commands that create the data needed for the query:

CREATE TABLE table1
ENGINE = MergeTree()
ORDER BY (id1, id2)
AS SELECT
number % 10 AS id1,
number AS id2,
rand64(number) AS rand_val
FROM numbers(30000000)
CREATE TABLE table2
engine = MergeTree()
ORDER BY (id1, id2)
AS SELECT
number % 10 AS id1,
number % 15 AS id2,
rand64(number) AS rand_val
FROM numbers(30000000);

This will create 2 tables end each contains 30 million rows.

Queries can be slow for a lot of external reasons (network congestion, limited disk space, high CPU usage, etc.). To be able to reliably assert that a query is slow, you should try to run it several times. Moreover, you should do this test on an isolated system so that another workload can’t bias the results.

Benchmarking queries

ClickHouse provides clickhouse-benchmark, a utility to run a query multiple times and get some statistics. It allows you to:

  • Run one (or many) query multiple times and gets percentiles of the speed
  • Run queries concurrently
  • Put restriction on queries (for example: max 1G of memory used, or only one thread processing the queries)

For more information, look at the official ClickHouse documentation.

In this example we will use clickhouse-benchmark to play the slow query 20 times and get the median time.

Terminal window
$ echo "SELECT count() FROM table1 WHERE rand_val IN ( SELECT rand_val FROM table2 WHERE (rand_val != 50000)) AND (rand_val != 4000)" | clickhouse-benchmark -i 20
Queries executed: 20.
localhost:9000, queries 20, QPS: 0.346, RPS: 20764323.787, MiB/s: 158.419, result RPS: 0.346, result MiB/s: 0.000.
0.000% 2.833 sec.
10.000% 2.848 sec.
20.000% 2.859 sec.
30.000% 2.861 sec.
40.000% 2.868 sec.
50.000% 2.894 sec.
60.000% 2.898 sec.
70.000% 2.909 sec.
80.000% 2.922 sec.
90.000% 2.935 sec.
95.000% 2.953 sec.
99.000% 2.972 sec.
99.900% 2.972 sec.
99.990% 2.972 sec.

We saw that the median time of the query is 2.9 seconds and the speed is quite stable.

Query plan

Our previous query took 2.9 seconds to process 30 millions rows on 4 threads. We consider that slow. Why is that?

We first need to understand how the query is processed by ClickHouse. This is where the EXPLAIN (or EXPLAIN PLAN) keyword becomes handy. With this keyword, you can ask ClickHouse to generate a query plan that describes all the logical operators that need to be applied on the data to process the query.

Let’s ask ClickHouse to generate this plan:

EXPLAIN
SELECT count()
FROM table1
WHERE (rand_val IN (
SELECT rand_val
FROM table2
WHERE rand_val != 50000
)) AND (rand_val != 4000)

And here comes the answer from ClickHouse:

┌─explain───────────────────────────────────────────────────────────────────────────┐
│ Expression ((Projection + Before ORDER BY)) │
│ CreatingSets (Create sets before main query execution) │
│ Aggregating │
│ Expression (Before GROUP BY) │
│ Filter (WHERE) │
│ SettingQuotaAndLimits (Set limits and quota after reading from storage) │
│ ReadFromMergeTree │
│ CreatingSet (Create set for subquery) │
│ Expression ((Projection + Before ORDER BY)) │
│ Filter (WHERE) │
│ SettingQuotaAndLimits (Set limits and quota after reading from storage) │
│ ReadFromMergeTree │
└───────────────────────────────────────────────────────────────────────────────────┘

This plan allows us to understand how ClickHouse processes the query. With experience you can guess what step of the plan is linked to what SQL operator. But if you don’t want to guess, you can ask ClickHouse to show a much more verbose plan with the following command:

EXPLAIN actions = 1, indexes = 1
SELECT count()
FROM table1
WHERE (rand_val IN (
SELECT rand_val
FROM table2
WHERE rand_val != 50000
)) AND (rand_val != 4000)
┌─explain────────────────────────────────────────────────────────────────────────────┐
│ [...redacted for brevity...] │
│ CreatingSet (Create set for subquery) │
│ Set: _subquery4 │
│ Expression ((Projection + Before ORDER BY)) │
│ Actions: INPUT :: 0 -> rand_val UInt64 : 0 │
│ Positions: 0 │
│ Filter (WHERE) │
│ Filter column: notEquals(rand_val, 50000) (removed) │
│ Actions: INPUT : 0 -> rand_val UInt64 : 0 │
│ COLUMN Const(UInt16) -> 50000 UInt16 : 1 │
│ FUNCTION notEquals(rand_val : 0, 50000 :: 1) │
│ -> notEquals(rand_val, 50000) UInt8 : 2 │
│ Positions: 0 2 │
│ SettingQuotaAndLimits (Set limits and quota after reading from storage) │
│ ReadFromMergeTree │
│ ReadType: Default │
│ Parts: 8 │
│ Granules: 3663 │
│ Indexes: │
│ PrimaryKey │
│ Condition: true │
│ Parts: 8/8 │
│ Granules: 3663/3663 │
└────────────────────────────────────────────────────────────────────────────────────┘
  • actions = 1 lets you see the corresponding actions (and therefore SQL expressions) done by each logical operator
  • indexes = 1 lets you see the number of block of data filtered by a WHERE clause on the primary key. In our case nothing is filtered (granule = 3663/3663 & part = 8/8) since we didn’t use the primary key.

This detailed view allows us to see that our subquery requires to create a Set (the CreatingSet (Create set for subquery) part). This Set contains the result of our SubQuery operations. In our case this is all the rand_val that matches the filter condition (the Filter in the WHERE part).

The creation of a Set is an expensive operation in most programming language. A Set in ClickHouse is actually a HashSet which means that when we ask it to create 30 million elements, it has to compute hashes. This is important for later.

Execution plan

The query plan gave us an idea of the main logical operators in the query. Let’s see how the query would be executed by ClickHouse using the EXPLAIN PIPELINE.

EXPLAIN PIPELINE
SELECT count()
FROM table1
WHERE (rand_val IN (
SELECT rand_val
FROM table2
WHERE rand_val != 50000
)) AND (rand_val != 4000)
┌─explain───────────────────────────────────┐
│ (Expression) │
│ ExpressionTransform │
│ (CreatingSets) │
│ DelayedPorts 2 → 1 │
│ (Aggregating) │
│ Resize 4 → 1 │
│ AggregatingTransform × 4 │
│ StrictResize 4 → 4 │
│ (Expression) │
│ ExpressionTransform × 4 │
│ (Filter) │
│ FilterTransform × 4 │
│ (SettingQuotaAndLimits) │
│ (ReadFromMergeTree) │
│ MergeTreeThread × 4 0 → 1 │
│ (CreatingSet) │
│ CreatingSetsTransform │
│ Resize 4 → 1 │
│ (Expression) │
│ ExpressionTransform × 4 │
│ (Filter) │
│ FilterTransform × 4 │
│ (SettingQuotaAndLimits) │
│ (ReadFromMergeTree) │
│ MergeTreeThread × 4 0 → 1 │
└───────────────────────────────────────────┘

This plan shows the real operators (called processors and located in the src/Processors folder in ClickHouse codebase) that will manage the query such as ExpressionTransform, Resize, or AggregatingTransform.

Moreover, the plan showed us the number of threads for each operation. We could see the subquery (defined by CreatingSet) starts by reading data on disk and filtering them using 4 threads. But the creation of the Set (CreatingSetsTransform) is done by only 1 thread. This is a major bottleneck as the creation of a Set is slow by nature. It is then likely that the reason of the slowness is the Set creation needed for the subquery as we suspected.

Temporary verbose log

At this stage, the suspected cause of our slowness is the Set (needed for the subquery). Now we needed to confirm it, by looking at the logs. We ran the query and asked ClickHouse to be as verbose as possible. Once connected with our clickhouse-client, we sent the following command to enable tracing (note that this might work using a JDBC client).

SET send_logs_level = 'trace'

We re-ran our query and checked the logs:

Terminal window
[christophe] 2021.09.13 12:38:02.473359 [ 1569254 ] {99b0d8bb-02be-424d-8ca2-b542704ae556} <Debug> executeQuery: (from 127.0.0.1:36566) SELECT count() FROM table1 WHERE rand_val IN ( SELECT rand_val FROM table2 WHERE (rand_val != 50000) ) AND (rand_val != 4000)
{...redacted for brevity...}
[christophe] 2021.09.13 12:38:02.476892 [ 1569398 ] {99b0d8bb-02be-424d-8ca2-b542704ae556} <Trace> CreatingSetsTransform: Creating set.
[christophe] 2021.09.13 12:38:04.806816 [ 1569398 ] {99b0d8bb-02be-424d-8ca2-b542704ae556} <Debug> CreatingSetsTransform: Created Set with 30000000 entries from 30000000 rows in 2.329918365 sec.
{...redacted for brevity...}
1 rows in set. Elapsed: 3.004 sec. Processed 60.00 million rows, 480.00 MB (19.97 million rows/s., 159.78 MB/s.)

This is what we learned by carefully reading the logs:

  • time taken to load, decompress and deserialize the data from disk
  • how many threads were used for a specific operation
  • how much data was pruned due to a filter on the primary key

Unfortunately, the logs wouldn’t let us know how long the filter operations took (like SELECT ... WHERE field1='A' or field2='B') since they’re included in the aggregation operation.

Nevertheless, those logs showed that our query took 3 seconds. Almost 80% of the time (2.3 seconds) was spent for the Set creation! We now had proof of the bottleneck.

Query log & trace log

Instead of manually enabling debug logs while the cluster is running, we could have used standard ClickHouse core tables that contain this information.

Here are the 2 most useful tables to understand what is happening in ClickHouse.

Query log

Any default ClickHouse installation have a system.query_log table enabled by default. Whenever a query starts or finishes, it is logged into that table along with its metadata (including its duration).

If you know approximately when you did your query, you can find it by filtering on the fields event_time and query.

For example:

select * from system.query_log
where event_time >='2021-08-25 13:38:41' and event_time <='2021-08-25 13:39:31'
and query like 'select%table1%' and type = ‘QueryFinish’ format Vertical;

The interesting fields of the query_log table are:

  • query_duration_ms: our best indicator for slow queries
  • read_rows, read_bytes, result_bytes, memory_usage: to get an idea of the amount of data
  • query: the raw text of the query that we can use to filter on
  • Settings.Names & Settings.Values: if you want to reproduce the query with the settings used during the slow query (for example if the query was configured with the max_threads settings to use only 1 core instead of half the server CPUs)
  • is_initial_query: when using a cluster, this boolean differentiates queries sent by a client from queries sent by another server of the cluster.
  • query_id: if you know exactly the query you’re looking for because you run it manually with clickhouse-client and got the id

This table will help you check your query history and see if similar queries were always slow or if it was a temporary glitch on a specific one.

Trace log

The trace_log table is another valuable source of information. It let us profile time spent in the actual C++ functions for any given query. While it requires a deeper understanding of ClickHouse source code, it does provide invaluable insights while benchmarking or testing new ClickHouse features.

Note that it is not enabled by default, as it has an impact on the overall performance. To activate it, you need to put the following XML block in the settings.xml of ClickHouse (the path is /etc/clickhouse-server/config.xml by default).

<trace_log>
<database>system</database>
<table>trace_log</table>
<partition_by>toYYYYMM(event_date)</partition_by>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
</trace_log>

You also need to install the clickhouse-common-static-dbg package in order to demangle the symbols in this table. Without it, you’ll only see a bunch of integers instead of function call stacks.

Terminal window
# for ubuntu/debian
sudo apt-get install clickhouse-common-static-dbg

Note that if you are not using Ubuntu, you can download and install the package from GitHub.

The interesting fields of the trace_log table are:

  • timestamp_ns: the timestamp in nanoseconds at the moment the sample was taken
  • trace_type: the metric the sample represents out of 4 types: CPU, Real, Memory, MemorySample
  • trace: the stack trace at the moment the sample was taken
  • query_id: the id of the query sampled

Once the trace_log is activated:

  • Connect to ClickHouse with clickhouse-client
  • Set query_profiler_real_time_period_ns and query_profiler_cpu_time_period_ns at 10000000 to ask ClickHouse to trace the C++ functions that are running. It will be done every 10 milliseconds.
  • Run the query you want to monitor
Terminal window
$ clickhouse-client
:) SET query_profiler_real_time_period_ns = 10000000
:) SET query_profiler_cpu_time_period_ns = 10000000
:) SELECT count() FROM table1 WHERE (rand_val IN (SELECT rand_val FROM table2 WHERE rand_val != 50000)) AND (rand_val != 4000)

The result from clickhouse-client provides you with the query id:

Terminal window
Query id: f435eee7-1518-4215-b872-19350cb8e6f7
┌─count()─┐
│ 0 │
└─────────┘
1 rows in set. Elapsed: 4.682 sec. Processed 60.00 million rows, 480.00 MB (12.82 million rows/s., 102.53 MB/s.)

Note: You see that this query that took 2.9 seconds in the previous parts now takes 4.7 seconds because of the overhead of the trace_log table.

Then:

  • run another connection to ClickHouse (don’t use the previous connection in order to avoid adding noise into the trace_log)
  • allow ClickHouse to demangle
  • query the trace_log table to see the 2 functions that took most of the time
  • specify the type of trace you’re looking for (‘Real’ = 0, ‘CPU’ = 1, ‘Memory’ = 2, ‘MemorySample’ = 3)
  • specify the query_id you’re looking for

In this example, we’re looking for the CPU trace. If a C++ function has 8 CPU traces it means more or less that ClickHouse spent 8 * 10 milliseconds on it.

Terminal window
$ clickhouse-client
:) SET allow_introspection_functions = 1
:) SELECT  count(),  arrayStringConcat(arrayMap(x -> concat(demangle(addressToSymbol(x)), '\n ', addressToLine(x)), trace), '\n') AS sym FROM system.trace_log WHERE (query_id = 'f435eee7-1518-4215-b872-19350cb8e6f7') AND (event_date = today()) AND trace_type=’CPU’ GROUP BY trace ORDER BY count() DESC LIMIT 2 format Vertical

The results are a bit verbose because the stack calls are very deep. Here is a summary of what we found:

Terminal window
count(): 195
sym: void DB::Set::executeImplCase<DB::SetMethodOneNumber<[big signature due to the high templatization in the CH codebase]
[big stack call]
/build/glibc-eX1tMB/glibc-2.31/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:97
count(): 137
sym: void DB::Set::insertFromBlockImplCase<DB::SetMethodOneNumber<[big signature due to the high templatization in the CH codebase]
/usr/bin/clickhouse
[big stack call]
/build/glibc-eX1tMB/glibc-2.31/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:97

We can see that:

  • 137 traces (so 137*10 milliseconds spent) are from the Set::insertFromBlockImplCase which, as per the source code, corresponds to the creation of the Set
  • 195 traces (so 195*10 milliseconds spent) are from the Set::executeImplCase which is also linked to the Set Overall, 3.32 seconds were spent on the Set out of the 4.7 seconds of the query. The Set represents 70% of the query duration.

Flamegraph

Reading such a trace_log is not easy. We found that a flamegraph was a much better visualisation. We used clickhouse-flamegraph for that job.

To install it, run:

Terminal window
curl -sL https://raw.githubusercontent.com/Slach/clickhouse-flamegraph/master/install.sh | sudo bash

By default clickhouse-framegraph reads all the data in the trace_log table and prints a flamegraph for each query and each type of metric (CPU, Real, Memory, MemorySample).

The tool accepts filters, allowing us to narrow our window to a specific time frame, keeping only CPU traces for our specific query. For example, here we’re printing a flamegraph of CPU usage for the query done in the trace_log part:

Terminal window
clickhouse-flamegraph --trace-types 'CPU' --query-ids 'f435eee7-1518-4215-b872-19350cb8e6f7' --date-from "2021-09-13 15:51:08 +0200" --dsn='http://localhost:8123'

And here is the result:

CPU flamegraph of a query
CPU flamegraph of a query

This provided a much more visual way to see that ClickHouse was spending most of its time in DB::SET::insertFromBlockImpl and DB::Set::executeImpl.

The flamegraph is interactive: click on a specific block to zoom in the execution stack.

To sum up

We’ve seen how to identify a slow query and understand why it is so slow. These tools are specific to ClickHouse but you should also look for standard sysadmin tools like sar to look for CPU, memory and disk usage. Even if it doesn’t give you a solution to speed up the query you can spot the root cause which is a good start. If you want to continue on this topic, you can look at this presentation made by Altinity.