Skip to content

Add new DataType Map(key,value) #15806

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Dec 16, 2020
Merged

Add new DataType Map(key,value) #15806

merged 22 commits into from
Dec 16, 2020

Conversation

hexiaoting
Copy link
Contributor

@hexiaoting hexiaoting commented Oct 10, 2020

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Add new datatype Map for supporting storage k:v .
related to issue: #1841
1st version for Map only support String type of key and value.
Later I will implement Int and other type for key and value.

Detailed description / Documentation draft:
Add new datatype Map for supporting storage k:v .
related to issue: #1841
1st version for Map only support String type of key and value.
Later I will implement Int and other type for key and value.
Usage:

create table table_map (a Map(String, String)) engine Memory;
insert into table_map values ({'name':'zhangsan', 'gender':'male'}), ({'name':'lisi', 'gender':'female'});
select a['name'] from table_map;

Addition:

PostgreSQL Json data type

postgresql does not support map data type but json, Operator for json in PostgreSQL:

Operator	Right Operand Type	Description	Example
->	int	Get JSON array element	'[1,2,3]'::json->2
->	text	Get JSON object field	'{"a":1,"b":2}'::json->'b'
->>	int	Get JSON array element as text	'[1,2,3]'::json->>2
->>	text	Get JSON object field as text	'{"a":1,"b":2}'::json->>'b'
#>	array of text	Get JSON object at specified path	'{"a":[1,2,3],"b":[4,5,6]}'::json#>'{a,2}'
#>>	array of text	Get JSON object at specified path as text	'{"a":[1,2,3],"b":[4,5,6]}'::json#>>'{a,2}'

And postgreSQL support other json functions like : xxx_to_json, json_each....

Hive Map data type

Usage:

create table map_test ( metrics MAP <STRING, BIGINT>);

desc map_test.metrics;
+-----------+------------------+--------------------+--+
| col_name  |    data_type     |      comment       |
+-----------+------------------+--------------------+--+
| metrics   | map<string,int>  | from deserializer  |
+-----------+------------------+--------------------+--+

And Hive does not support literals for complex types (array, map, struct, union)
select * from map_test;
{"abc": 123}

@robot-clickhouse robot-clickhouse added doc-alert pr-feature Pull request with new product feature labels Oct 10, 2020
@CurtizJ CurtizJ self-assigned this Oct 12, 2020
@hexiaoting
Copy link
Contributor Author

hexiaoting commented Oct 13, 2020

@CurtizJ
Hi, I build these codes in my environment successfully with gcc10 or clang11.
How can I reproduce the build error in "ClickHouse build check" in my environment?

@CurtizJ
Copy link
Member

CurtizJ commented Oct 13, 2020

You can try to merge with fresh master.

/build/src/Functions/FunctionsConversion.h:2201:19: error: no member named 'getByPosition' in 'std::__1::vector<DB::ColumnWithTypeAndName, std::__1::allocator<DB::ColumnWithTypeAndName>>'
Seems that error appears because recently Block from Block.h was replaced to ColumnsWithTypeAndName in functions (here)

@ildus
Copy link
Contributor

ildus commented Oct 13, 2020

Have you considered using separate columns for each key? In that case it would be easy to implement combinators and such stuff, since then it would be no need to collect values by keys first.

@hexiaoting
Copy link
Contributor Author

Have you considered using separate columns for each key? In that case it would be easy to implement combinators and such stuff, since then it would be no need to collect values by keys first.

In the first implement version for Map Datatype, I use tow array columns, one for key and the other for value. I think it's the simplest way to implement this.
If the operator of combinators in Map type is needed by many users, I will try to implement it in later version.

@hexiaoting
Copy link
Contributor Author

@CurtizJ I have finished coding. Can you review it when free? Thank you.

@CurtizJ
Copy link
Member

CurtizJ commented Oct 21, 2020

Yes, I will try to review your PR by the end of the week.

BTW, please, add tests with other table engines, at least with MergeTree, because storage Memory doesn't use serialization/deserialization from IDataType and it's not checked. I've created a MergeTree table with type Map and got segfault on insert.

CREATE TABLE table_map(a Map(String, String)) ENGINE = MergeTree ORDER BY tuple();

insert into table_map values ({'name':'zhangsan', 'gender':'male'}), ({'name':'lisi', 'gender':'female'});
stacktrace

2020.10.22 01:01:29.264497 [ 41075 ] {} <Fatal> BaseDaemon: ########################################
2020.10.22 01:01:29.264648 [ 41075 ] {} <Fatal> BaseDaemon: (version 20.11.1.4942, build id: DB5B18ED166F29B8) (from thread 40789) (query_id: ec9d1905-25c1-48b6-bea9-1b82849f3c01) Received signal Segmentation fault (11)
2020.10.22 01:01:29.264685 [ 41075 ] {} <Fatal> BaseDaemon: Address: 0x98 Access: read. Address not mapped to object.
2020.10.22 01:01:29.264724 [ 41075 ] {} <Fatal> BaseDaemon: Stack trace: 0x7bbe247 0xd7ac1ce 0xd82bc01 0xe3f0807 0xe3f02fe 0xe3f09e5 0xe4b2e5c 0xe42b50c 0xe34ba8c 0xdbf0184 0xdc31043 0xdc2fdcc 0xdc2fe49 0xe5e9c40 0xe5e52a9 0xe5f1db7 0x10dc8def 0x10dca7fe 0x10efbab9 0x10ef79ea 0x7fb0857da6db 0x7fb0850f788f
2020.10.22 01:01:29.264828 [ 41075 ] {} <Fatal> BaseDaemon: 2. DB::WriteBuffer::write(char const*, unsigned long) @ 0x7bbe247 in /usr/bin/clickhouse
2020.10.22 01:01:29.264904 [ 41075 ] {} <Fatal> BaseDaemon: 3. DB::DataTypeArray::serializeBinaryBulkWithMultipleStreams(DB::IColumn const&, unsigned long, unsigned long, DB::IDataType::SerializeBinaryBulkSettings&, std::__1::shared_ptr<DB::IDataType::SerializeBinaryBulkState>&) const @ 0xd7ac1ce in /usr/bin/clickhouse
2020.10.22 01:01:29.264937 [ 41075 ] {} <Fatal> BaseDaemon: 4. DB::DataTypeMap::serializeBinaryBulkWithMultipleStreams(DB::IColumn const&, unsigned long, unsigned long, DB::IDataType::SerializeBinaryBulkSettings&, std::__1::shared_ptr<DB::IDataType::SerializeBinaryBulkState>&) const @ 0xd82bc01 in /usr/bin/clickhouse
2020.10.22 01:01:29.265011 [ 41075 ] {} <Fatal> BaseDaemon: 5. DB::MergeTreeDataPartWriterCompact::writeColumnSingleGranule(DB::ColumnWithTypeAndName const&, std::__1::function<DB::WriteBuffer* (std::__1::vector<DB::IDataType::Substream, std::__1::allocator<DB::IDataType::Substream> > const&)>, unsigned long, unsigned long) @ 0xe3f0807 in /usr/bin/clickhouse
2020.10.22 01:01:29.265059 [ 41075 ] {} <Fatal> BaseDaemon: 6. DB::MergeTreeDataPartWriterCompact::writeBlock(DB::Block const&) @ 0xe3f02fe in /usr/bin/clickhouse
2020.10.22 01:01:29.265086 [ 41075 ] {} <Fatal> BaseDaemon: 7. DB::MergeTreeDataPartWriterCompact::finishDataSerialization(DB::MergeTreeDataPartChecksums&, bool) @ 0xe3f09e5 in /usr/bin/clickhouse
2020.10.22 01:01:29.265122 [ 41075 ] {} <Fatal> BaseDaemon: 8. DB::MergedBlockOutputStream::writeSuffixAndFinalizePart(std::__1::shared_ptr<DB::IMergeTreeDataPart>&, bool, DB::NamesAndTypesList const*, DB::MergeTreeDataPartChecksums*) @ 0xe4b2e5c in /usr/bin/clickhouse
2020.10.22 01:01:29.265155 [ 41075 ] {} <Fatal> BaseDaemon: 9. DB::MergeTreeDataWriter::writeTempPart(DB::BlockWithPartition&, std::__1::shared_ptr<DB::StorageInMemoryMetadata const> const&) @ 0xe42b50c in /usr/bin/clickhouse
2020.10.22 01:01:29.265181 [ 41075 ] {} <Fatal> BaseDaemon: 10. DB::MergeTreeBlockOutputStream::write(DB::Block const&) @ 0xe34ba8c in /usr/bin/clickhouse
2020.10.22 01:01:29.265208 [ 41075 ] {} <Fatal> BaseDaemon: 11. DB::PushingToViewsBlockOutputStream::write(DB::Block const&) @ 0xdbf0184 in /usr/bin/clickhouse
2020.10.22 01:01:29.265237 [ 41075 ] {} <Fatal> BaseDaemon: 12. DB::AddingDefaultBlockOutputStream::write(DB::Block const&) @ 0xdc31043 in /usr/bin/clickhouse
2020.10.22 01:01:29.265266 [ 41075 ] {} <Fatal> BaseDaemon: 13. DB::SquashingBlockOutputStream::finalize() @ 0xdc2fdcc in /usr/bin/clickhouse
2020.10.22 01:01:29.265282 [ 41075 ] {} <Fatal> BaseDaemon: 14. DB::SquashingBlockOutputStream::writeSuffix() @ 0xdc2fe49 in /usr/bin/clickhouse
2020.10.22 01:01:29.265314 [ 41075 ] {} <Fatal> BaseDaemon: 15. DB::TCPHandler::processInsertQuery(DB::Settings const&) @ 0xe5e9c40 in /usr/bin/clickhouse
2020.10.22 01:01:29.265339 [ 41075 ] {} <Fatal> BaseDaemon: 16. DB::TCPHandler::runImpl() @ 0xe5e52a9 in /usr/bin/clickhouse
2020.10.22 01:01:29.265372 [ 41075 ] {} <Fatal> BaseDaemon: 17. DB::TCPHandler::run() @ 0xe5f1db7 in /usr/bin/clickhouse
2020.10.22 01:01:29.265400 [ 41075 ] {} <Fatal> BaseDaemon: 18. Poco::Net::TCPServerConnection::start() @ 0x10dc8def in /usr/bin/clickhouse
2020.10.22 01:01:29.265417 [ 41075 ] {} <Fatal> BaseDaemon: 19. Poco::Net::TCPServerDispatcher::run() @ 0x10dca7fe in /usr/bin/clickhouse
2020.10.22 01:01:29.265442 [ 41075 ] {} <Fatal> BaseDaemon: 20. Poco::PooledThread::run() @ 0x10efbab9 in /usr/bin/clickhouse
2020.10.22 01:01:29.265471 [ 41075 ] {} <Fatal> BaseDaemon: 21. Poco::ThreadImpl::runnableEntry(void*) @ 0x10ef79ea in /usr/bin/clickhouse
2020.10.22 01:01:29.265502 [ 41075 ] {} <Fatal> BaseDaemon: 22. start_thread @ 0x76db in /lib/x86_64-linux-gnu/libpthread-2.27.so
2020.10.22 01:01:29.265613 [ 41075 ] {} <Fatal> BaseDaemon: 23. /build/glibc-OTsEL5/glibc-2.27/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:97: __clone @ 0x12188f in /usr/lib/debug/lib/x86_64-linux-gnu/libc-2.27.so

@hexiaoting
Copy link
Contributor Author

@CurtizJ Ok, I will add tests with MergeTree Engine

Copy link
Member

@CurtizJ CurtizJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need to implement function map (similar to array and tuple) to allow creation of maps not only for literals.

To allow something like this:

insert into table_map(m) select {'key1' : number, 'key2' : number * 2} from numbers(1000)
insert into table_map(m) select map('key1', number, 'key2', number * 2) from numbers(1000)

}
else
{
// Default value for unmatched keys
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to use default value of type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how to distinguish the real value and the default value?
for example, Integer type's default value is 0

const DataTypePtr & key_type = (typeid_cast<const DataTypeArray *>(kv_types[0].get()))->getNestedType();
const DataTypePtr & value_type = (typeid_cast<const DataTypeArray *>(kv_types[1].get()))->getNestedType();

Field index = (*columns[arguments[1]].column)[0];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems, that only const indices are supported. Need to add check for constness of argument. But it's better to implement support of non-const columns as index, like it's done in arrayElement function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will improve this.

}

template <typename DataType>
bool FunctionArrayElement::executeMappedValueNumber(const ColumnArray * column, std::vector<int> matched_idxs,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a const reference for vector.

const ColumnArray::Offsets & offsets = column->getOffsets();
size_t rows = offsets.size();

for (size_t i = 0; i < rows; i++)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you know number of elements, it's better to do reserve for vector before loop for better performance.

std::string DataTypeMap::doGetName() const
{
WriteBufferFromOwnString s;
s << "Map(" << (typeid_cast<const DataTypeArray *>(keys.get()))->getNestedType()->getName()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to typeid_cast here. You have already saved key_type and value_type.

@Enmk Enmk mentioned this pull request Oct 26, 2020
ColumnMap::ColumnMap(MutableColumns && mutable_columns)
{
columns.reserve(mutable_columns.size());
for (auto & column : mutable_columns)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you can init ColumnMap::columns to have less than 2 columns, that would cause a crash at run-time in say ColumnMap::cloneEmpty() or any other method that assumes that there are exactly 2 columns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider checking that each of mutable_columns is actually a ColumnArray.

SELECT CAST(([1, 2, 3], ['1', '2', 'foo']), 'Map(UInt8, String)') AS map

void DataTypeMap::serializeTextJSON(const IColumn & column, size_t row_num, WriteBuffer & ostr, const FormatSettings & settings) const
{
writeChar('[', ostr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be more natural to serialize Map as a JSON object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

UInt8 type;
DB::readBinary(type, buf);

switch (type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using Field::dispatch + overloaded functions, to avoid enumerating all Field types in many places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an example, see writeFieldText or toString(Field) in Field.cpp.

@hexiaoting
Copy link
Contributor Author

hexiaoting commented Nov 3, 2020

@akuzm @CurtizJ @Enmk
Now in the latest implementation, Map's usage includes:

insert into table_map(m) select {'key1' : number, 'key2' : number * 2} from numbers(1000)
insert into table_map(m) select map('key1', number, 'key2', number * 2) from numbers(1000)

But this is repeated with query parameter.

From the test I found the syntax of DataType Map is confused with below test cases:

  1. 01526_param_uuid's test case is :
${CLICKHOUSE_CLIENT} --param_p1='ffffffff-ffff-ffff-ffff-ffffffffffff' --query "SELECT {p1:UUID}"
  1. 01342_query_parameters_alias:
$CLICKHOUSE_CLIENT --param_x '\N' --query 'SELECT {x:Nullable(Nothing)} as a' --format TSVWithNamesAndTypes

3)01056_prepared_statements_null_and_escaping.sh

${CLICKHOUSE_CURL} -sS "${CLICKHOUSE_URL}&param_x=Hello,%20World" \
    -d "SELECT {x:Nullable(String)}";
......

4)01015_insert_values_parametrized.sh

$CLICKHOUSE_CLIENT --input_format_values_deduce_templates_of_expressions=0 --input_format_values_interpret_expressions=1 --param_p_n="-1" --param_p_s="param" --param_p_a="[0.2,0.3]" --query="INSERT INTO insert_values_parametrized  VALUES \
(5 + {p_n:Int8}, lower(concat('Evaluate', {p_s:String})), arrayIntersect([0, 0.2, 0.6], {p_a:Array(Nullable(Float32))}))
  1. 00956_http_prepared_statements
${CLICKHOUSE_CURL} -sS "${CLICKHOUSE_URL}&param_id=1" \
    -d "SELECT * FROM ps WHERE i = {id:UInt8} ORDER BY i, s, d";
  1. 00955_complex_prepared_statements
$CLICKHOUSE_CLIENT --max_threads=1 --param_aui="[1, 2]" \
    -q "SELECT t FROM ps WHERE a = {aui:Array(UInt16)}";

7)00954_client_prepared_statements.sh

$CLICKHOUSE_CLIENT --max_threads=1 --param_phrase='Hello, world' \
    -q "SELECT * FROM ps WHERE s = {phrase:String}";

But

SELECT {p1:UUID}
SELECT {x:Nullable(Nothing)} as a
......

can also be parsed as Map .
How can I solve this problem?

@hexiaoting hexiaoting marked this pull request as draft November 6, 2020 09:19
@hexiaoting hexiaoting marked this pull request as ready for review November 10, 2020 03:26
@hexiaoting
Copy link
Contributor Author

@CurtizJ ready for review.

@hexiaoting
Copy link
Contributor Author

@alexey-milovidov @CurtizJ
hi, right now this PR is ready to review.
In this version, data in Map DataType is stored using two columns, one for storing keys and the other is for storing values. When using map['key'] to get the value I use sequential iterate method, Of course this is the slowest way。

After this PR is reviewed, in the next stage:

  1. @Enmk is trying to implement index based lookup method to speedup.
  2. consider using separate columns for each key to make the Map DataType directly columnar storage.

@CurtizJ
Copy link
Member

CurtizJ commented Dec 1, 2020

Thanks. I will review and merge soon.

@hexiaoting
Copy link
Contributor Author

This PR is moved to #17829

CurtizJ added a commit that referenced this pull request Dec 16, 2020
@CurtizJ CurtizJ merged commit 12604ce into ClickHouse:master Dec 16, 2020
@meshpaul
Copy link

🥇

@ahutsunshine
Copy link

@Enmk @CurtizJ @akuzm May I ask when and what's the plan to build a new version to release the feature?

@wangxin05
Copy link

@alexey-milovidov @CurtizJ hi, right now this PR is ready to review. In this version, data in Map DataType is stored using two columns, one for storing keys and the other is for storing values. When using map['key'] to get the value I use sequential iterate method, Of course this is the slowest way。

After this PR is reviewed, in the next stage:

  1. @Enmk is trying to implement index based lookup method to speedup.
  2. consider using separate columns for each key to make the Map DataType directly columnar storage.

Hi,
Thank you so much for providing this great data type. And by the way, what is the current progress of the second item(consider using separate columns for each key to make the Map DataType directly columnar storage.)?

@hexiaoting
Copy link
Contributor Author

@wangxin05 maybe this feature will be implemented in #23932

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature Pull request with new product feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants