Adds AddFiles Procedure #2210

RussellSpitzer · 2021-02-04T04:13:41Z

No description provided.

This procedure mimics our old support for "IMPORT DATA" but does not allow for dynamic overwriting of files in partitions. This will now require a seperate DELETE command to remove a partition. Other than that, the capabilites are identical to the previous functionality except now the functionality is a procedure rather than an SQL command.

RussellSpitzer · 2021-02-04T04:14:33Z

Implementation of #2068

spark3/src/main/java/org/apache/iceberg/spark/Spark3Util.java

aokolnychyi · 2021-02-05T18:36:23Z

Thanks for working on this, @RussellSpitzer! Let me take a look today.

spark3/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java

aokolnychyi · 2021-02-05T19:27:04Z

spark3/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java

+      }
+      validatePartitionSpec(table, dataPath, fs, partition);
+
+      if (table.properties().get(TableProperties.DEFAULT_NAME_MAPPING) == null) {


I am +1 on this too. I think it will be safer.

spark3/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java

pvary · 2021-02-18T10:51:55Z

I am looking for a way to migrate a native Hive table to a Hive table backed by an Iceberg table, and during my search I found this PR (along with the #2068 - Procedure for adding files to a Table), but I have also found the same thing for Flink: #2217 (Flink : migrate hive table to iceberg table).

Maybe we should factor-out the common parts to a place which is accessible for both Spark and Flink (and for Hive as well). Like a common java API or something like it. Would this be possible?

RussellSpitzer · 2021-02-18T13:39:09Z

@pvary we also have the migrate and snapshot actions

rdsr · 2021-02-20T05:54:09Z

spark3/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java

+        fs = dataPath.getFileSystem(conf);
+        isFile = fs.getFileStatus(dataPath).isFile();
+      } catch (IOException e) {
+        throw new RuntimeException("Unable to access add_file path", e);


Maybe UncheckedIOException makes more sense. Or maybe just replace with Util.getFs ?

I need to go back through the PR and switch these to unchecked IO's, this is a good suggestion but in the midst of refactoring I forgot to do it.

spark3/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java

rdsr · 2021-02-20T06:22:29Z

spark3/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java

+              table.name(), partition.size(), partitionFields.size()));
+      }
+      partitionFields.forEach(field -> {
+        if (!partition.containsKey(field.name())) {


From the above discussion, I gather that Hive partitions can also be imported? Do we see issues here around Hive's lowercasing of columns? If yes, then something to consider for future enhancements...

Yeah this could be a problem if the entries that Spark returns are different than those we expect based on column names, i'll need to add some hive tests for that

Previously we always just used the Spark InMemory file tool to list the partitions in a table regardless of whether it was a pure file based table or a Hive based table. Now instead we have two distinct paths, either we have a file based table which is listed like `parquet.path` or a hive table which has the standard `database.table`. When a table identifier is give for hive we attempt to read the Catalog's partition listing which lets us discover alternate partition locations and match the underlying hive catalog.

Change function calls and some error messages

There seems to be a classpath issue with ORC and Spark, not Iceberg related. I'll Check this out tomorrow

RussellSpitzer · 2021-02-26T04:39:14Z

I'm having issues with writing to ORC tables from spark in our tests, probably a version conflict?

Caused by: java.lang.NoSuchMethodError: org.apache.orc.TypeDescription.createRowBatch()Lorg/apache/hadoop/hive/ql/exec/vector/VectorizedRowBatch;

RussellSpitzer · 2021-02-26T17:17:10Z

We have some strange classpath issues with ORC in our test classpath, to make things worse, when I attempt to debug the issue the debug console does not have the same issue and can call the missing method without issue.

spark3/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java

spark3-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java

pvary · 2021-03-10T09:05:29Z

@pvary + @aokolnychyi Should be ready for another look if ya'll have time, Thanks in advance!

LGTM on Hive related stuff. Only minors and mostly questions.
When we start working on Hive migration definitely will move some of the code from the SparkTableUtil, Spark3Util to some more accessible place, but that's another story.
Too much Spark code to be comfortable giving a +1.

aokolnychyi

I left a few nits and I'd love to see a couple of Avro tests. Other than that, looks ready to go.

Thanks for the great work, @RussellSpitzer!

spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java

spark3/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java

spark3-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java

aokolnychyi · 2021-03-10T20:35:54Z

spark3-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java

+  }
+
+
+  @Ignore  // Classpath issues prevent us from actually writing to a Spark ORC table


Can we add a couple of tests for Avro?

spark3-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java

aokolnychyi

I have a few nits/questions but should be ready to go otherwise.

spark3/src/main/java/org/apache/iceberg/spark/Spark3Util.java

spark3/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java

.baseline/checkstyle/checkstyle.xml

spark3/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java

aokolnychyi

LGTM. I think we need to revert no longer needed change in the style file and it should be good to go. There were a few super minor nits but that's optional.

Thanks for the hard work, @RussellSpitzer! I am sure this will be really useful.

.baseline/checkstyle/checkstyle.xml

spark3/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java

aokolnychyi · 2021-03-23T06:25:38Z

Thank you, @RussellSpitzer! Thanks everyone who participated in the review!

aokolnychyi · 2021-03-23T06:27:02Z

@RussellSpitzer, could you please create issues for ORC and Avro tests? We already have an issue to not depend on Spark in-memory file index.

RussellSpitzer · 2021-03-23T15:03:18Z

Thanks everybody! I think the review period really helped make this code better, I hope it's of use to folks in the future.

Fixes 2068

zhaoyuxiaoxiao · 2021-09-10T02:33:12Z

@RussellSpitzer hello，How to associate an existing ICEBERG table with a hive table。What is the syntax of sparkSQL? Can you give an example？thanks

RussellSpitzer · 2021-09-10T13:20:46Z

I'm not sure what you are asking, the syntax for this command is

CALL spark_catalog.system.add_files(
  table => 'db.tbl',
  source_table => '`parquet`.`path/to/table`'
)

I'll be adding docs shortly but you should probably ask your question with more detail about what you are trying to accomplish on the mailing list or the Slack since this may not do what you expect it to do.

talgos1 · 2021-11-22T10:01:35Z

Hi, I'm wondering if it's possible to add a specific file (or even an unpartitioned parquet/orc table) to a partitioned iceberg table (specific partition)?

rdblue · 2021-11-22T16:50:33Z

@talgos1, have you tried using a full file path as the source table?

talgos1 · 2021-11-23T06:42:54Z

@talgos1, have you tried using a full file path as the source table?

@rdblue
Using the CALL command, it expects the source to have same partition spec as destination (got an error the source has no partitions)

Since my last comment, I succeeded doing that using the spark and java APIs and explicitly defining a synthetic SparkPartition for the file/path

// Define the source file
val uri = "/some/path/to/orc/file.orc"
val format = "orc"
val partitionSpec: util.Map[String, String] = Map("some_partition_key" -> "some_partition_value").asJava
// Define a synthetic spark partition
val sparkPartition = new SparkPartition(partitionSpec, uri,  format)

// Do the add call for importing partitioned source table
SparkTableUtil.importSparkPartitions(spark, Seq(sparkPartition).asJava, table, spec, stagingDir)

WDYT?

rdblue · 2021-11-23T23:44:28Z

I'm glad you were able to find a way to get it working! We may want to update this so that we can detect when there's only a single file and handle that case. But in the meantime, it looks like this is a good way to get it working using the API directly.

chenwyi2 · 2022-08-26T06:05:36Z

there is a way to import a hive table into a specific parition in iceberg? the partition schema between hive table and iceberg is different, for example, a hive table has two partitions but a iceberg tablehas three paritions whcih contains a bucket(id) partition, we want import this hive table into a specific parition in iceberg. when we add files into iceberg error with "because that table is partitioned and contains non-identitypartition transforms which will not be compatible. Found non-identity fields [1000: bucketid: bucket2] "

github-actions bot added the spark label Feb 4, 2021

RussellSpitzer commented Feb 4, 2021

View reviewed changes

spark3/src/main/java/org/apache/iceberg/spark/Spark3Util.java Show resolved Hide resolved

Fix Checkstyle errors from merge

04ae169

RussellSpitzer force-pushed the AddFilesProcedure branch from f8e4619 to 04ae169 Compare February 4, 2021 14:38

aokolnychyi reviewed Feb 5, 2021

View reviewed changes

spark3/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java Show resolved Hide resolved

aokolnychyi reviewed Feb 5, 2021

View reviewed changes

spark3/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java Show resolved Hide resolved

aokolnychyi reviewed Feb 5, 2021

View reviewed changes

spark3/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Feb 5, 2021

View reviewed changes

aokolnychyi mentioned this pull request Feb 5, 2021

Procedure for adding files to a Table #2068

Closed

pvary reviewed Feb 18, 2021

View reviewed changes

spark3/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java Outdated Show resolved Hide resolved

pvary mentioned this pull request Feb 18, 2021

Flink : migrate hive table to iceberg table #2217

Closed

rdsr reviewed Feb 20, 2021

View reviewed changes

RussellSpitzer added 3 commits February 25, 2021 16:28

Fix up existing parquet tests

51d6c02

Change function calls and some error messages

Adds additional tests for hive and ORC

55d5c9e

There seems to be a classpath issue with ORC and Spark, not Iceberg related. I'll Check this out tomorrow

Disable ORC Tests and Switch Hive to Catalog

4847699

aokolnychyi reviewed Mar 3, 2021

View reviewed changes

spark3/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Mar 3, 2021

View reviewed changes

spark3/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Mar 3, 2021

View reviewed changes

spark3/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Mar 3, 2021

View reviewed changes

spark3/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Mar 3, 2021

View reviewed changes

spark3/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Mar 3, 2021

View reviewed changes

spark3/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java Show resolved Hide resolved

aokolnychyi reviewed Mar 3, 2021

View reviewed changes

spark3/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java Outdated Show resolved Hide resolved

pvary reviewed Mar 10, 2021

View reviewed changes

spark3-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java Show resolved Hide resolved

Additional Reviewer Feedback

8d086e7

aokolnychyi reviewed Mar 10, 2021

View reviewed changes

RussellSpitzer commented Mar 10, 2021

View reviewed changes

spark3-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java Outdated Show resolved Hide resolved

RussellSpitzer added 2 commits March 11, 2021 12:26

Wip

43d9191

Add more tests from Reviewer Comments

33a21da

github-actions bot added the INFRA label Mar 16, 2021

Fix typos in tests

14941a0

aokolnychyi reviewed Mar 19, 2021

View reviewed changes

RussellSpitzer added 2 commits March 22, 2021 17:08

Reviewer Comments

40be453

Remove unused import

a000173

aokolnychyi approved these changes Mar 23, 2021

View reviewed changes

More nit cleanup

a4279fc

aokolnychyi merged commit f0a6b71 into apache:master Mar 23, 2021

RussellSpitzer deleted the AddFilesProcedure branch March 23, 2021 15:03

coolderli pushed a commit to coolderli/iceberg that referenced this pull request Apr 26, 2021

Spark: Implement add_files procedure (apache#2210)

1c83566

Fixes 2068

stevenzwu pushed a commit to stevenzwu/iceberg that referenced this pull request Jul 28, 2021

Spark: Implement add_files procedure (apache#2210)

1fa858c

Fixes 2068

rdblue mentioned this pull request Aug 17, 2021

Add 0.12.0 release notes pt 2 #2986

Merged

MichaelTiemannOSC mentioned this pull request Mar 31, 2022

Guidance needed for using AddFiles with Iceberg os-climate/os_c_data_commons#153

Open

		}


		@Ignore // Classpath issues prevent us from actually writing to a Spark ORC table

Adds AddFiles Procedure #2210

Adds AddFiles Procedure #2210

Conversation

RussellSpitzer commented Feb 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RussellSpitzer commented Feb 4, 2021

Uh oh!

Uh oh!

aokolnychyi commented Feb 5, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aokolnychyi Feb 5, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pvary commented Feb 18, 2021

Uh oh!

RussellSpitzer commented Feb 18, 2021

Uh oh!

rdsr Feb 20, 2021

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Feb 25, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rdsr Feb 20, 2021

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Feb 25, 2021

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer commented Feb 26, 2021

Uh oh!

RussellSpitzer commented Feb 26, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pvary commented Mar 10, 2021

Uh oh!

aokolnychyi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aokolnychyi Mar 10, 2021

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Mar 10, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aokolnychyi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aokolnychyi left a comment

Choose a reason for hiding this comment

RussellSpitzer commented Feb 4, 2021 •

edited

Loading

talgos1 commented Nov 22, 2021 •

edited

Loading

talgos1 commented Nov 23, 2021 •

edited

Loading