Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of LowCardinality and/or INDEX to speed up query performance #4796

Closed
healiseu opened this issue Mar 25, 2019 · 6 comments
Closed

Use of LowCardinality and/or INDEX to speed up query performance #4796

healiseu opened this issue Mar 25, 2019 · 6 comments
Assignees
Labels
question Question?

Comments

@healiseu
Copy link

healiseu commented Mar 25, 2019

UPDATED 28th March 2019, 15:22 Europe/Athens (EET, +0200)


Hi, earlier this year we had a discussion about the status of LowCardinality feature (#4074).

Now I have upgraded to ClickHouse 19.4.0.49 and I checked documentation, the feature is still not documented. Moreover recent bugfixes at the Changelog show an additional set index feature for Nullable and LowCardinality columns that is fixed.

I want to use these features to replace data types of fields that have low cardinality but I am bit confused on how to use them and what cases show a clear advantage in terms of performance and storage efficiency. That is why I wrote the following artificial benchmark test to share with you my results and discuss any alternatives.


Storage engines - Insert performance and Footprint

A. Create and load table using plain data types (UInt64, Float64, String)

DROP TABLE IF EXISTS Dict
CREATE TABLE Dict
(
    d2 UInt32,
    d1 UInt32, 
    uint UInt64,
    flt Float64,
    str String 
)
ENGINE = MergeTree()
PARTITION BY d2
ORDER BY (d2, d1)

INSERT INTO Dict SELECT 
    intDiv(number, 100000) AS d2,
    number AS d1,
    (rand64() % 7000 +1)*10000 AS uint,
    uint * pi() as flt
    ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten'][rand()%10+1] AS str
FROM numbers(10000000)
--- Elapsed: 2.592 sec. Processed 10.03 million rows, 80.22 MB

OPTIMIZE TABLE Dict FINAL
--- Elapsed: 1.496 sec

B. Create and load table using LowCardinality data types

CREATE TABLE LCDict
(
    d2 UInt32,
    d1 UInt32, 
    uintlc LowCardinality(UInt64),
    fltlc LowCardinality(Float64),
    strlc LowCardinality(String)
)
ENGINE = MergeTree()
PARTITION BY d2
ORDER BY (d2, d1)

INSERT INTO LCDict SELECT d2, d1, uint, flt, str FROM Dict
--- Elapsed: 4.159 sec. Processed 10.00 million rows, 369.00 MB

OPTIMIZE TABLE LCDict FINAL
--- Elapsed: 1.782 sec
SELECT table, sum(bytes) / 1048576 AS size, sum(rows) AS rows
FROM system.parts 
WHERE active AND (database = 'DemoDB') AND (table LIKE '%Dict')
GROUP BY table ORDER BY size DESC

┌─table──┬───────────────size─┬─────rows─┐
│ Dict   │ 157.6411781311035210000000 │
│ LCDict │  93.6633892059326210000000 │
└────────┴────────────────────┴──────────┘

This benchmark shows that the storage engine with LowCardinality data types has a significantly better footprint (94MB vs 157MB) for the cost of significantly reduced insertion speed (4.16sec vs 2.59sec).

1. Grouping queries

1a. With UInt64 numbers

SELECT uint, count(uint) AS cnt FROM Dict 
GROUP BY uint

(0.183+0.189+0.189)/3
7000 rows in set. Elapsed: 0.187 sec. Processed 10.00 million rows, 80.00 MB

SELECT uintlc, count(uintlc) AS cnt FROM LCDict 
GROUP BY uintlc

(0.152+0.158+0.158)/3
7000 rows in set. Elapsed: 0.156 sec. Processed 10.00 million rows, 31.20 MB

1b. With Float64 numbers

SELECT flt, count(flt) AS cnt FROM Dict 
GROUP BY flt

(0.214+0.225+0.264)/3
7000 rows in set. Elapsed: 0.234 sec. Processed 10.00 million rows, 80.00 MB

SELECT fltlc, count(fltlc) AS cnt FROM LCDict 
GROUP BY fltlc

(0.191+0.191+0.191)/3
7000 rows in set. Elapsed: 0.191 sec. Processed 10.00 million rows, 31.20 MB

1c. With String

SELECT str, count(str) AS cnt  FROM Dict  
GROUP BY str;

(0.260+0.223+0.223)/3
10 rows in set. Elapsed: 0.235 sec. Processed 10.00 million rows, 129.00 MB

SELECT strlc, count(strlc) AS cnt  FROM LCDict  
GROUP BY strlc;

(0.089+0.101+0.090)/3
10 rows in set. Elapsed: 0.093 sec. Processed 10.00 million rows, 10.03 MB

2. String pattern queries

SELECT d2, d1, str FROM Dict 
WHERE str LIKE '%en';

(0.729+0.727+0.748)/3
2000218 rows in set. Elapsed: 0.734 sec. Processed 10.00 million rows, 145.00 MB

SELECT d2, d1, strlc FROM LCDict 
WHERE strlc LIKE '%en';

(0.159+0.163+0.166)/3
2000218 rows in set. Elapsed: 0.162 sec. Processed 10.00 million rows, 26.03 MB

3. Number range queries

3a. With UInt64 numbers

SELECT d2, d1, uint FROM Dict 
WHERE uint BETWEEN 2000000 and 7000000 ORDER by uint;

(0.375+0.378+0.394)/3
715826 rows in set. Elapsed: 0.382 sec. Processed 10.00 million rows, 85.73 MB


SELECT d2, d1, uintlc FROM LCDict 
WHERE uintlc BETWEEN 2000000 and 7000000 ORDER by uintlc;

(0.678+0.628+0.663)/3
715826 rows in set. Elapsed: 0.656 sec. Processed 10.00 million rows, 36.93 MB

3b. With Float64 numbers

SELECT d2, d1, flt FROM Dict 
WHERE flt BETWEEN 2000000 and 7000000 ORDER by flt;

(0.341+0.378+0.380)/3
227011 rows in set. Elapsed: 0.366 sec. Processed 10.00 million rows, 81.82 MB


SELECT d2, d1, fltlc FROM LCDict 
WHERE fltlc BETWEEN 2000000 and 7000000 ORDER by fltlc;

(0.450+0.438+0.475)/3
227011 rows in set. Elapsed: 0.454 sec. Processed 10.00 million rows, 33.02 MB 

Point Queries

4a. With UInt64 numbers

SELECT d2, d1, uint FROM Dict 
WHERE uint = 10000 ORDER BY d2, d1

(0.090+0.124+0.129)/3
1467 rows in set. Elapsed: 0.114 sec. Processed 10.00 million rows, 80.01 MB


SELECT d2, d1, uintlc FROM LCDict 
WHERE uintlc = 10000 ORDER BY d2, d1

(0.099+0.098+0.095)/3
1467 rows in set. Elapsed: 0.095 sec. Processed 10.00 million rows, 31.21 MB

4b. With Float64 numbers

SELECT d2, d1, flt FROM Dict
WHERE flt=31415.926535897932 ORDER BY d2, d1

(0.109+0.107+0.093)/3
1467 rows in set. Elapsed: 0.103 sec. Processed 10.00 million rows, 80.01 MB


SELECT d2, d1, fltlc FROM LCDict
WHERE fltlc=31415.926535897932 ORDER BY d2, d1

(0.126+0.094+0.097)/3
1467 rows in set. Elapsed: 0.105 sec. Processed 10.00 million rows, 31.21 MB

4c. With String

SELECT d2, d1, str FROM Dict 
WHERE str = 'ten' ORDER BY d2, d1

(0.446+0.463+0.424)/3
999712 rows in set. Elapsed: 0.444 sec. Processed 10.00 million rows, 137.00 MB


SELECT d2, d1, strlc FROM LCDict 
WHERE strlc = 'ten' ORDER BY d2, d1

(0.487+0.443+0.447)/3
999712 rows in set. Elapsed: 0.459 sec. Processed 10.00 million rows, 18.03 MB

Summary

  1. There is a noticeable difference in the processing volume of data in all queries when you use LowCardinality data types.

  2. Examples 1a, 1b, 1c (grouping queries) and 2 (string pattern queries) show clearly an improvement in performance when we use LowCardinality feature.

  3. On the contrary examples 4a, 4b, 4c (point queries) show insignificant differences in execution time and 3a, 3b (number range queries) show that searching with LowCardinality data types is slower than their plain data types counterparts, despite processing volume is significantly reduced with the use of LowCardinality feature.

Question:

Is there an alternative solution to improve speed of point and number range queries above using LowCardinality and/or INDEX; if yes how would you write the engine and queries above ?

Specs of the machine running the benchmark

  • Linux Ubuntu x64 16.04
  • Intel(R) Core(TM) i3 CPU 540 @ 3.07GHz
  • RAM 16GB
  • SSD 2.5inches SATA 3.2, 3.0 Gb/s Transcend 120GB 220S TS120GSSD220S

Thanks in advance for your valuable assistance

External References

  1. String Optimization in ClickHouse - Alexander Zaitsev
  2. Magical Mystery Tour of the LowCardinality Data Type - Nikolai Kochetov
@KochetovNicolai
Copy link
Member

I will try to comment your result.

First of all, let me briefly describe LowCardinality structure. ClickHouse process data by blocks. Block is a set of columns. But block can't store all data from column. Instead, it stores parts of each column (for example, first 8192 values). This column part can be stored as a plain array (usual datatype) or as a pair (dictionary, keys) (LowCardinality). Here, dictionary generally is a dictionary for block (but several blocks can share single dictionary). So, original part of column is dictionary[keys].

LowCardinality type is mainly made for String case. The idea of LowCardinality type is to avoid manual string enumeration (which will have better performance than LowCardinality, but not always comfortable of possible). There is usually no reason to use LowCardinality for numeric types. The actual reason why it was added to support rule f(dictionary[keys]) = f(dictionary)[keys]. So, when any function with LowCardinality argument is executed, it is executed over dictionary, and the result is another LowCardinality column.

LowCardinality(numeric) is not very efficient because keys array may use the same amount of memory as original array itself. It's always true for UInt8. So, for LowCardinality(numeric) we usually process more data than for usual column.

When query is executed, the execution time is usually a sum of time to read data from disc and time to process it. exec_time = read_time + process_time. For some queries read_time can be much more than process_time. It's usually true for queries which only read data and filter them (select * where).

For examples 4a, 4b, 4c read_time much more than process_time. So, for most cases differences are not significant.

For example 2, like function takes significant time, so rule f(dictionary[keys]) = f(dictionary)[keys] improves performance.

For examples 3a, 3b the main time spent by ORDER BY. It seems that sorting is faster for ordinary numeric columns because for LowCardinality we used double indexing. Probably, it would be better to convert column to ordinary before order by (need to check it). However, conversion also takes time, so such queries with LowCardinality will likely be slower.

Grouping queries are faster because of special optimization for single LowCardinality column. For several grouping keys, it will likely not be much better.

As for set index, it works independently to LowCardinality. I think it will make sense to use both LowCardinality and set index. (However, haven't tested performance yet.). I would recommend to use set index for numeric columns, and LowCardinality for low cardinality string columns. And also to test the case for LowCardinality and set index together for your dataset to check if it improves something.

@KochetovNicolai
Copy link
Member

Also, your test for insertion is not completely fair, because for LCDict you read from another table, but for Dict - not. Anyway, insertion to LCDict is slower indeed. What I have tested:

:) create table Dict2 as Dict

CREATE TABLE Dict2 AS Dict

Ok.

0 rows in set. Elapsed: 0.003 sec. 

 :) INSERT INTO LCDict SELECT d2, d1, uint, flt, str FROM Dict

INSERT INTO LCDict SELECT 
    d2, 
    d1, 
    uint, 
    flt, 
    str
FROM Dict 

Ok.

0 rows in set. Elapsed: 3.196 sec. Processed 10.00 million rows, 369.00 MB (3.13 million rows/s., 115.47 MB/s.) 

 :) INSERT INTO Dict2 SELECT d2, d1, uint, flt, str FROM Dict

INSERT INTO Dict2 SELECT 
    d2, 
    d1, 
    uint, 
    flt, 
    str
FROM Dict 

Ok.

0 rows in set. Elapsed: 2.026 sec. Processed 10.00 million rows, 369.00 MB (4.94 million rows/s., 182.15 MB/s.) 

@healiseu
Copy link
Author

healiseu commented Mar 28, 2019

@KochetovNicolai your comments are indeed helpful, it seems this positional encoded representation is not suitable for all cases. I think it can be varied according to the data column. Have you considered to use bit-stuffed pointers as keys. Such indices only have as many bits that it takes to represent the field. Then bitwise operators could speed up calculations in functions.

Regarding Data Skipping Indices I need to understand what is the rule of thumb to relate type, and granularity_value parameters of skip index with table parts, primary index granularity and distinct values (frequency?) of numeric/string column data type in those MergeTree storage engines I created above.

I think a nice graphical illustration, such as the one for sorting key and marks will make this clear.

clickhouse_index

SELECT formatReadableSize(bytes) AS size, rows FROM system.parts 
WHERE active AND (database = 'DemoDB') AND (table = 'Dict')

┌─size─────┬───rows─┐
│ 1.58 MiB │ 100000 │
│ 1.58 MiB │ 100000----------------------1.58 MiB │ 100000 │
└──────────┴────────┘
100 rows in set. Elapsed: 0.006 sec.

SELECT count(*) AS UInt64_DistinctValues FROM 
( SELECT uint, count(uint) AS cnt FROM Dict 
  GROUP BY uint) 
┌─UInt64_DistinctValues─┐
│                  7000 │
└───────────────────────┘
--- Average Frequency: 1428

┌─Float64_DistinctValues─┐
│                   7000 │
└────────────────────────┘
--- Average Frequency: 1428

┌─String_DistinctValues─┐
│                    10 │
└───────────────────────┘
--- Average Frequency: 100000

i. I guess for 3a, 3b, 4a, 4b (numeric queries) the type is minmax but what is going to be the value for the GRANULARITY related to SETTINGS index_granularity ?

ii. Point query 4c will probably use the set type for the skip index but again what values I should use for the number of members in the set and granularity here ?

Can you make a start to show how to write two MergeTree engines (i, ii) with sensible parameters for the skip index according to the specifications above ?

@stale
Copy link

stale bot commented Oct 20, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@jakesylvestre
Copy link

I assume for this same reason FixedString doesn't improve LowCardinality performance?

@amirvaza
Copy link
Contributor

amirvaza commented Dec 29, 2021

Hi @KochetovNicolai

We also have some questions regarding LC

On ClickHouse docs it says (https://clickhouse.com/docs/en/sql-reference/data-types/lowcardinality/)
"The efficiency of using LowCardinality data type depends on data diversity.
If a dictionary contains less than 10,000 distinct values, then ClickHouse mostly shows higher efficiency of data reading and storing.
If a dictionary contains more than 100,000 distinct values, then ClickHouse can perform worse in comparison with using ordinary data types."

On the reference down the page it says: (https://altinity.com/blog/2019/3/27/low-cardinality)
"There is no hard limit in ClickHouse, the results are usually good if number of distinct values is below 10M.
For big tables with many partitions and parts, the total number of distinct values can even be higher, if 10M limit is preserved at part level."

As I understand, each part holds one or more dictionaries in size of <low_cardinality_max_dictionary_size> which defaults to 8192.
During parts merging process, those dictionaries merge also.

Our table consist of (String, Map<String, String>), ideally, we will use LC(String) for all, but:
On the worst case scenario (in case of 50 k/v for each Map), with 3 LC(String), there will be 8192x50x50=20M values for each Granlue.
On a more reasonable scenario, we will have 100k-500K different values (mostly because of the Values in the map) for each Granlue.
Part-wise, we will have a lot more because each part consist of many granules.

My questions:

  1. First of all, why in the official docs it states "100k distinct values can perform worse" while in the research that the docs have a link to, it states 10M which is 100 times more? Which one is the real number?

  2. Is it possible to avoid part merges in case we have a lot of distinct values? Is there any other way to avoid such cases?

  3. In case of all values are distinct, LC has no real use because reading the dict from the disk is the same as reading the column itself from the disk? (and even worse?)

  4. How does a read (query) using LC is implemented? AFAIK CH reads one part at a time, block by block, is this why we should avoid large dictionaries? Both IO and RAM explosion of the entire part?
    Does it get read for every query? Or only group by query? What if I just use uniq() or I use a specific value in WHERE? In such case the "real" values aren't important for the result

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question?
Projects
None yet
Development

No branches or pull requests

5 participants