-
Notifications
You must be signed in to change notification settings - Fork 7.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use of LowCardinality and/or INDEX to speed up query performance #4796
Comments
I will try to comment your result. First of all, let me briefly describe
When query is executed, the execution time is usually a sum of time to read data from disc and time to process it. For examples 4a, 4b, 4c For example 2, For examples 3a, 3b the main time spent by Grouping queries are faster because of special optimization for single As for |
Also, your test for insertion is not completely fair, because for LCDict you read from another table, but for Dict - not. Anyway, insertion to LCDict is slower indeed. What I have tested: :) create table Dict2 as Dict
CREATE TABLE Dict2 AS Dict
Ok.
0 rows in set. Elapsed: 0.003 sec.
:) INSERT INTO LCDict SELECT d2, d1, uint, flt, str FROM Dict
INSERT INTO LCDict SELECT
d2,
d1,
uint,
flt,
str
FROM Dict
Ok.
0 rows in set. Elapsed: 3.196 sec. Processed 10.00 million rows, 369.00 MB (3.13 million rows/s., 115.47 MB/s.)
:) INSERT INTO Dict2 SELECT d2, d1, uint, flt, str FROM Dict
INSERT INTO Dict2 SELECT
d2,
d1,
uint,
flt,
str
FROM Dict
Ok.
0 rows in set. Elapsed: 2.026 sec. Processed 10.00 million rows, 369.00 MB (4.94 million rows/s., 182.15 MB/s.)
|
@KochetovNicolai your comments are indeed helpful, it seems this positional encoded representation is not suitable for all cases. I think it can be varied according to the data column. Have you considered to use bit-stuffed pointers as keys. Such indices only have as many bits that it takes to represent the field. Then bitwise operators could speed up calculations in functions. Regarding Data Skipping Indices I need to understand what is the rule of thumb to relate type, and granularity_value parameters of skip index with table parts, primary index granularity and distinct values (frequency?) of numeric/string column data type in those MergeTree storage engines I created above. I think a nice graphical illustration, such as the one for sorting key and marks will make this clear. SELECT formatReadableSize(bytes) AS size, rows FROM system.parts
WHERE active AND (database = 'DemoDB') AND (table = 'Dict')
┌─size─────┬───rows─┐
│ 1.58 MiB │ 100000 │
│ 1.58 MiB │ 100000 │
----------------------
│ 1.58 MiB │ 100000 │
└──────────┴────────┘
100 rows in set. Elapsed: 0.006 sec.
SELECT count(*) AS UInt64_DistinctValues FROM
( SELECT uint, count(uint) AS cnt FROM Dict
GROUP BY uint)
┌─UInt64_DistinctValues─┐
│ 7000 │
└───────────────────────┘
--- Average Frequency: 1428
┌─Float64_DistinctValues─┐
│ 7000 │
└────────────────────────┘
--- Average Frequency: 1428
┌─String_DistinctValues─┐
│ 10 │
└───────────────────────┘
--- Average Frequency: 100000 i. I guess for 3a, 3b, 4a, 4b (numeric queries) the type is ii. Point query 4c will probably use the Can you make a start to show how to write two |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I assume for this same reason |
We also have some questions regarding LC On ClickHouse docs it says (https://clickhouse.com/docs/en/sql-reference/data-types/lowcardinality/) On the reference down the page it says: (https://altinity.com/blog/2019/3/27/low-cardinality) As I understand, each part holds one or more dictionaries in size of <low_cardinality_max_dictionary_size> which defaults to 8192. Our table consist of (String, Map<String, String>), ideally, we will use LC(String) for all, but: My questions:
Thanks |
UPDATED 28th March 2019, 15:22 Europe/Athens (EET, +0200)
Hi, earlier this year we had a discussion about the status of LowCardinality feature (#4074).
Now I have upgraded to ClickHouse 19.4.0.49 and I checked documentation, the feature is still not documented. Moreover recent bugfixes at the Changelog show an additional
set
index feature for Nullable and LowCardinality columns that is fixed.I want to use these features to replace data types of fields that have low cardinality but I am bit confused on how to use them and what cases show a clear advantage in terms of performance and storage efficiency. That is why I wrote the following artificial benchmark test to share with you my results and discuss any alternatives.
Storage engines - Insert performance and Footprint
A. Create and load table using plain data types (UInt64, Float64, String)
B. Create and load table using LowCardinality data types
This benchmark shows that the storage engine with LowCardinality data types has a significantly better footprint (94MB vs 157MB) for the cost of significantly reduced insertion speed (4.16sec vs 2.59sec).
1. Grouping queries
1a. With UInt64 numbers
1b. With Float64 numbers
1c. With String
2. String pattern queries
3. Number range queries
3a. With UInt64 numbers
3b. With Float64 numbers
Point Queries
4a. With UInt64 numbers
4b. With Float64 numbers
4c. With String
Summary
There is a noticeable difference in the processing volume of data in all queries when you use LowCardinality data types.
Examples 1a, 1b, 1c (grouping queries) and 2 (string pattern queries) show clearly an improvement in performance when we use LowCardinality feature.
On the contrary examples 4a, 4b, 4c (point queries) show insignificant differences in execution time and 3a, 3b (number range queries) show that searching with LowCardinality data types is slower than their plain data types counterparts, despite processing volume is significantly reduced with the use of LowCardinality feature.
Question:
Is there an alternative solution to improve speed of point and number range queries above using LowCardinality and/or INDEX; if yes how would you write the engine and queries above ?
Specs of the machine running the benchmark
Thanks in advance for your valuable assistance
External References
The text was updated successfully, but these errors were encountered: