Skip to content

Feature: support deduplication on stage attachment api #11710

Closed
@ZhiHanZ

Description

@ZhiHanZ
Collaborator

Summary
To ensure data ingestion idempotency, databend has already support to deduplicate DML through deduplication label
https://databend.rs/doc/sql-commands/setting-cmds/set-var

Here for cross-language driver integration, we could add a rest api field for the label

Activity

akoshchiy

akoshchiy commented on Jun 14, 2023

@akoshchiy
Contributor

@ZhiHanZ Hi! Can I try to fix this? As I understood, we should extend the HttpQueryRequest with a new field and then pass it to the QueryContext settings on the http query creation.

ZhiHanZ

ZhiHanZ commented on Jun 15, 2023

@ZhiHanZ
CollaboratorAuthor

That is perfect, I think we do not need to add additional field on the HttpRequest, we could bring QueryID Header https://github.com/datafuselabs/databend/blob/7dd2a992338f25bf3ce883c8f36f77cd79d5c74f/src/query/service/src/servers/http/v1/http_query_handlers.rs#L41 for deduplication, which is mentioned on previous issue:
#11591.

Expected Behavior:

CREATE TABLE sample
(
    Id      INT,
    City    VARCHAR,
    Score   INT,
);

sample.csv

1,'Los Angeles',100
2,'Irvine',80
3,'San Diego',60
4,'Palo alto',70
5,'San Jose',55
6,'Milipitas',99
curl -s -u root: -XPOST "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/query" --header 'Content-Type: application/json'  --header 'X-DATABEND-QUERY-ID:  insert1' -d '{"sql": "insert into sample (Id, City, Score) values (?,?,?)", "stage_attachment": {"location": "@s1/sample.csv", "copy_options": {"purge": "true"}}}' | jq -r '.stats.scan_progress.bytes, .error'

1,'Los Angeles',100
2,'Irvine',80
3,'San Diego',60
4,'Palo alto',70
5,'San Jose',55
6,'Milipitas',99

curl -s -u root: -XPOST "http://localhost:${QUERY_HTTP_HANDLER_PORT}/v1/query" --header 'Content-Type: application/json'  --header 'X-DATABEND-QUERY-ID:  insert1' -d '{"sql": "insert into sample (Id, City, Score) values (?,?,?)", "stage_attachment": {"location": "@s1/sample.csv", "copy_options": {"purge": "true"}}}' | jq -r '.stats.scan_progress.bytes, .error'

No more inserted rows because of deduplication based on query id insert1
1,'Los Angeles',100
2,'Irvine',80
3,'San Diego',60
4,'Palo alto',70
5,'San Jose',55
6,'Milipitas',99

changed the title [-]Feature: support to bring deduplication label on stage attachment api[/-] [+]Feature: support deduplication on stage attachment api[/+] on Jun 15, 2023
akoshchiy

akoshchiy commented on Jun 15, 2023

@akoshchiy
Contributor

Does it mean, that we also should use provided X-DATABEND-QUERY-ID as query_id instead of generating it?

ZhiHanZ

ZhiHanZ commented on Jun 16, 2023

@ZhiHanZ
CollaboratorAuthor

Does it mean, that we also should use provided X-DATABEND-QUERY-ID as query_id instead of generating it?

exactly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Development

    Participants

    @akoshchiy@ZhiHanZ

    Issue actions

      Feature: support deduplication on stage attachment api · Issue #11710 · databendlabs/databend