Core: Data loss after compaction #2195 #2196

Stephen-Robin · 2021-02-02T06:48:44Z

For details, please refer to #2195

rdblue · 2021-02-02T23:54:13Z

@aokolnychyi, can you take a look at this?

zhangjun0x01 · 2021-02-03T01:08:11Z

I am sorry, this is a known bug,I had found the bug when I did the Rewrite Action,and I had open a PR #1762 , just not merged ,the purpose of this rewrite Action is to compaction small files, so I think it is more reasonable to exclude data files which size > the target size during table scan.

rdblue · 2021-02-03T01:49:27Z

@zhangjun0x01, next time please highlight that there is a correctness problem. I didn't know that #1762 fixed a correctness problem or we would have prioritized it and made sure it was in the 0.11.0 release. Thanks for pointing us to that issue, we will take a look at both alternative solutions.

Stephen-Robin · 2021-02-03T02:19:38Z

I am sorry, this is a known bug,I had found the bug when I did the Rewrite Action,and I had open a PR #1762 , just not merged ,the purpose of this rewrite Action is to compaction small files, so I think it is more reasonable to exclude data files which size > the target size during table scan.

@zhangjun0x01
Hi zhangjun, I found that large files exceeding the threshold were filtered out in PR.#1762 , Perhapsd the rewrite data file operation should not only includes small file merging, but also large files should be segmented and rewritten. This PR has already rewritten large files after segmentation. What do you think about this, and thanks rdblue for pushing this issue.

zhangjun0x01 · 2021-02-03T02:44:53Z

@Stephen-Robin I think there is no need to split the large file, because if the file size exceeds the target size, it will be automatically split into multiple CombinedScanTasks when reading, and read concurrently, instead of having a task to read the large file. Splitting a large file into multiple small files will make the Rewrite Action consume more resources, and too many small files are not friendly to hdfs. what do you think about ?

zhangjun0x01 · 2021-02-03T03:12:00Z

@zhangjun0x01, next time please highlight that there is a correctness problem. I didn't know that #1762 fixed a correctness problem or we would have prioritized it and made sure it was in the 0.11.0 release. Thanks for pointing us to that issue, we will take a look at both alternative solutions.

I'm very sorry, this problem should be resolved before 0.11.0 release. When I submitted this pr, I requested openinx to help me to review, maybe he be busy, I forgot this pr when 0.11.0 was released, I must pay attention to this problem next time.

HeartSaVioR · 2021-02-03T03:17:25Z

You're welcome to subscribe dev@ mailing list and participate discussion, RC verification, etc. Github mention is easy to slip through, so once you find some urgent things like regression or correctness, dev@ mailing list (or Slack channel) is appropriate place to share.

Stephen-Robin · 2021-02-03T03:46:10Z

@Stephen-Robin Splitting a large file into multiple small files will make the Rewrite Action consume more resources, and too many small files are not friendly to hdfs. what do you think about ?

@zhangjun0x01
I think it makes sense to split large files into smaller files.

the user’s goal is to get the expected file size.
When the split target size is set to such as 256M, rewriting large files after splitting will not significantly increase the load of hdfs.
After splitting into small files, it is convenient for filtering by file metadata in manifest.
We can also add an option to let users decide whether to split large files

zhangjun0x01 · 2021-02-03T05:26:57Z

When the split target size is set to such as 256M, rewriting large files after splitting will not significantly increase the load of hdfs.

yes,I thought wrong

After splitting into small files, it is convenient for filtering by file metadata in manifest.

it sounds make sense

RussellSpitzer · 2021-02-03T15:12:02Z

I think this PR is pretty much ready to go other than a few nits

RussellSpitzer · 2021-02-03T15:15:57Z

My quick notes on this issue:

Previously when computing the rewrite tasks for RewriteDataFiles the code
would ignore scan tasks which referred to a single file. This is an issue because
large files could be potentitally split into multiple read tasks. If one
slice of a large file was combined with a slice from another file, that
secition would be rewritten with the other file, but the other slices would be ignored.

For example given 2 files
File A - 100 Bytes
File B - 10 Bytes

If the target split size was 60 bytes we would end up with 3 tasks
A : 1 - 60
A : 61 - 100
B : 0 - 10

Which would be combined into

(A : 1 - 60)
(A : 61 -100, B : 0 -10)

The first task would be discarded since it only refered to one file. The
second task would be rewritten, which would end with deleting file A and B.

I believe the original intent was to ignore single file scan tasks as it was assumed these would
be unchanged files. But if a single file scan task only contains a partial scan of a file it must 
be rewritten since it represents a new smaller file that needs to be rewritten.

Normally this doesn't cause data loss since an ignored file won't be deleted, but if a split is
combined with another file, then that triggers the delete of the large file, even though several
splits of the large file will not have been written into new files.

RussellSpitzer · 2021-02-03T16:47:50Z

lgtm, @aokolnychyi do you have any comments?

rdblue · 2021-02-03T17:00:29Z

spark/src/test/java/org/apache/iceberg/actions/TestRewriteDataFilesAction.java

+
+    Actions actions = Actions.forTable(table);
+
+    long targetSizeInBytes = file.length() - 10;


It would be easier to understand if this were set up where file is used to create dataFile. Or, this could use the length of dataFile instead so we don't need to make sure dataFile and file have the same length.

I just don’t understand this very well, I need to set splitTargetSize to a value smaller than the largest file

rdblue · 2021-02-03T17:00:31Z

spark/src/test/java/org/apache/iceberg/actions/TestRewriteDataFilesAction.java

+
+    CloseableIterable<FileScanTask> tasks = table.newScan().planFiles();
+    List<DataFile> dataFiles = Lists.newArrayList(CloseableIterable.transform(tasks, FileScanTask::file));
+    Assert.assertEquals("Should have 3 scan tasks before rewrite", 3, dataFiles.size());


Minor: I think we should refer to these in the context as "files" not "tasks" because tasks are usually what we get after splitting and combining.

Ah yeah I think I was mistaken here, I think we may be generating multiple files via the writeRecords method which could potentially parallelize the write resulting in multiple files. (This is for the 2 record file)

Instead we would need to do a repartition(1) to insure a single file is written.

In my internal test I just did the repartition to make sure we had a single file writes

rdblue · 2021-02-03T17:04:27Z

spark/src/test/java/org/apache/iceberg/actions/TestRewriteDataFilesAction.java

+        fileAppender.add(record);
+        excepted.add(record);
+      }
+    }


Why does this create an appender instead of using writeRecords?

I think this could easily find the largest data file from planFiles and base the length on that instead.

As I noted above, you may need a modified "writeRecords" which uses a single partition if you want to generate 1 file.

Okay, thank you for your comments, I will make changes immediately

rdblue · 2021-02-03T17:08:20Z

spark/src/test/java/org/apache/iceberg/actions/TestRewriteDataFilesAction.java

+    Assert.assertEquals("Action should add 2 data files", 2, result.addedDataFiles().size());
+
+    long postRewriteNumRecords = spark.read().format("iceberg").load(tableLocation).count();
+    List<Object[]> rewrittenRecords = sql("SELECT * from rows sort by c2");


I'm not very comfortable with using the same view to load original and rewritten records. Can you create a separate view for the rewritten data? That way we avoid any weird caching behavior.

rdblue · 2021-02-03T17:09:34Z

spark/src/test/java/org/apache/iceberg/actions/TestRewriteDataFilesAction.java

+
+    CloseableIterable<FileScanTask> tasks = table.newScan().planFiles();
+    List<DataFile> dataFiles = Lists.newArrayList(CloseableIterable.transform(tasks, FileScanTask::file));
+    Assert.assertEquals("Should have 3 scan tasks before rewrite", 3, dataFiles.size());


This appends one file directly (the big one, dataFile) and one using writeRecords. Why are there 3 files in the table at this point?

yes. 1 maxSizeFile + 2 row data file

rdblue · 2021-02-03T17:10:07Z

spark/src/test/java/org/apache/iceberg/actions/TestRewriteDataFilesAction.java

+        .splitOpenFileCost(1)
+        .execute();
+
+    Assert.assertEquals("Action should delete 4 data files", 4, result.deletedDataFiles().size());


What are the 4 data files that this action deletes? Is one of the 3 from above duplicated?

yeah, I'm wondering if I need to perform deduplication

I wonder if this is an error in RewriteDatafile Actions, where it should use a set to collect the deleted files. Like we are getting 1 record of "delete" for each split in the large file.

Let's fix this one later.

rdblue · 2021-02-04T01:18:11Z

spark/src/test/java/org/apache/iceberg/actions/TestRewriteDataFilesAction.java

+
+    CloseableIterable<FileScanTask> tasks = table.newScan().planFiles();
+    List<DataFile> dataFiles = Lists.newArrayList(CloseableIterable.transform(tasks, FileScanTask::file));
+    DataFile maxSizeFile = Collections.max(dataFiles, Comparator.comparingLong(DataFile::fileSizeInBytes));


This looks good.

rdblue · 2021-02-04T01:19:50Z

Thank you for fixing this, @Stephen-Robin! I've tagged this for inclusion in the 0.11.1 patch release. We should probably do that soon since we have a correctness bug.

Fixes #2195.

Fixes apache#2195.

github-actions bot added the core label Feb 2, 2021

rdblue requested a review from aokolnychyi February 2, 2021 23:54

Stephen-Robin force-pushed the rewriteDataFileFix branch from 33cee26 to 709b5b2 Compare February 3, 2021 14:30

github-actions bot added the spark label Feb 3, 2021

Core: Data loss after compaction apache#2195

e04d456

Stephen-Robin force-pushed the rewriteDataFileFix branch from 709b5b2 to e04d456 Compare February 3, 2021 14:40

RussellSpitzer mentioned this pull request Feb 3, 2021

Core: Data loss after compaction #2195

Closed

Stephen-Robin added 2 commits February 4, 2021 00:11

Core: Data loss after compaction apache#2195

07ce6e7

Core: Data loss after compaction apache#2195

fd5558c

rdblue reviewed Feb 3, 2021

View reviewed changes

Stephen-Robin added 3 commits February 4, 2021 02:17

Core: Data loss after compaction apache#2195

30ab0e5

Core: Data loss after compaction apache#2195

06f7b03

Core: Data loss after compaction apache#2195

c0cfe34

rdblue added this to the Java 0.11.1 Release milestone Feb 4, 2021

rdblue approved these changes Feb 4, 2021

View reviewed changes

rdblue reviewed Feb 4, 2021

View reviewed changes

rdblue merged commit e2a0ba4 into apache:master Feb 4, 2021

Stephen-Robin deleted the rewriteDataFileFix branch February 4, 2021 02:22

aokolnychyi pushed a commit that referenced this pull request Mar 24, 2021

Core: Fix data loss in compact action (#2196)

4b2f390

Fixes #2195.

zhangjun0x01 mentioned this pull request Apr 25, 2021

Core : fix bugs in RewriteDataFilesAction when datafile size greater than targetFileSize #2508

Closed

coolderli pushed a commit to coolderli/iceberg that referenced this pull request Apr 26, 2021

Core: Fix data loss in compact action (apache#2196)

5a03933

Fixes apache#2195.


		Actions actions = Actions.forTable(table);

		long targetSizeInBytes = file.length() - 10;

Core: Data loss after compaction #2195 #2196

Core: Data loss after compaction #2195 #2196

Conversation

Stephen-Robin commented Feb 2, 2021

Uh oh!

rdblue commented Feb 2, 2021

Uh oh!

zhangjun0x01 commented Feb 3, 2021

Uh oh!

rdblue commented Feb 3, 2021

Uh oh!

Stephen-Robin commented Feb 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhangjun0x01 commented Feb 3, 2021

Uh oh!

zhangjun0x01 commented Feb 3, 2021

Uh oh!

HeartSaVioR commented Feb 3, 2021

Uh oh!

Stephen-Robin commented Feb 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhangjun0x01 commented Feb 3, 2021

Uh oh!

RussellSpitzer commented Feb 3, 2021

Uh oh!

RussellSpitzer commented Feb 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RussellSpitzer commented Feb 3, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Feb 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Feb 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Stephen-Robin Feb 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Feb 4, 2021

Uh oh!

Stephen-Robin commented Feb 3, 2021 •

edited

Loading

Stephen-Robin commented Feb 3, 2021 •

edited

Loading

RussellSpitzer commented Feb 3, 2021 •

edited

Loading

RussellSpitzer Feb 3, 2021 •

edited

Loading

rdblue Feb 3, 2021 •

edited

Loading

Stephen-Robin Feb 3, 2021 •

edited

Loading