-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-28895][K8S] Support defining HADOOP_CONF_DIR and config map at the same time #25609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…if the client process has files to upload
Test build #109859 has finished for PR 25609 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #109861 has finished for PR 25609 at commit
|
Test build #109862 has finished for PR 25609 at commit
|
// should download the configmap to our client side. | ||
// 1. add configurations from k8s configmap to hadoopConf | ||
conf.get(KUBERNETES_HADOOP_CONF_CONFIG_MAP).foreach { cm => | ||
val hadoopConfFiles = client.configMaps().withName(cm).get().getData.asScala |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yaooqinn This happens at submission time at the launcher machine, it is weird to fetch the configmap from the cluster locally and not the way to go in my opinion. You could just point to the right hadoop config and spark submit will pick it up. spark.kubernetes.hadoop.configMapName
was meant to be used at the driver pod and so that the hadoop files can be mounted on the fly within the cluster. Even if you launch cluster mode in the cluster you can do the same, mount the configmap and point to the files via the HADOOP_CONF_CONFIG var. @erikerlandson @dongjoon-hyun @ifilonenko fyi.
Btw configamaps are namespaced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark.kubernetes.hadoop.configMapName
has nodriver
word in its name, so I guess it is ok to be used all over the spark application including the client process.- Using
spark.kubernetes.hadoop.configMapName
andHADOOP_CONF_DIR
is an either-or thing, check here. It is more like that they both contain the same thing and are used by client to define the driver pod, so they should be equal. For now, the only difference between them is that theHADOOP_CONF_DIR
is in the classpath of client process and the other not. Withspark.kubernetes.hadoop.configMapName
, if our application has no extra file (--jars/--files) to upload, it works. But once it does need one or more, it fails. I guess this behavior forspark.kubernetes.hadoop.configMapName
is kind of unacceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that they are either-or (extraneous is meant for the cluster deployment not this new feature), but I think if you specify them both (modify the code in there to allow both of them to be defined not having either-or and select according to user's preference or use the pod template feature to emulate the configmap mounting) it should work as spark submit is supposed to pick up the hadoop credentials (eg.
if ((clusterManager == MESOS || clusterManager == KUBERNETES) |
HADOOP_CONF_DIR
, but I do find fetching the configmap from the cluster redundant. If you can download the configuration you could just add it at spark submit time at the client machine (it is not safer or anything if you fetch it afaik)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ifilonenko thoughts here? Do you think we can get the kerberos integration tests PR merged? That is the only viable way to make sure things are stable in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they are both specified, the main corner case seems like the scenario where they aren't consistent, but this could be checked for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with setting them both, which is my original idea too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skonto ill make sure to get that PR resolved so that we can use it for testing, yeah
// 2. add configurations from arguments or spark properties file to hadoopConf | ||
SparkHadoopUtil.appendS3AndSparkHadoopConfigurations(conf, hadoopConf) | ||
// 3. set or rest user group information | ||
UserGroupInformation.setConfiguration(hadoopConf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark submit should take care of the security stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. But for now, the spark submit is quite yarn-specific. For security stuff contains in configMap of k8s, it's not working.
Test build #110165 has finished for PR 25609 at commit
|
Kubernetes integration test starting |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Kubernetes integration test status success |
Kubernetes integration test starting |
Kubernetes integration test status success |
gentle ping @skonto |
is configured, the hadoop configurations will only be used by the Driver and its Executors. If your client process has | ||
extra dependencies to upload to `spark.kubernetes.file.upload.path`, you may need to configure `HADOOP_CONF_DIR` too. | ||
When these two variables are both set, Spark will prefer `spark.kubernetes.hadoop.configMapName` to be mounted on the | ||
Driver/Executor pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concept looks good to me, @ifilonenko any corner cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is still okay to warn the user about, the configuration is being picked from the configMap, inspite of the fact, HADOOP_CONF_DIR is defined.
...e/src/test/scala/org/apache/spark/deploy/k8s/features/HadoopConfDriverFeatureStepSuite.scala
Show resolved
Hide resolved
@erikerlandson LGTM, we need an integration test for this @ifilonenko gentle ping. |
Test build #111938 has finished for PR 25609 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
anything blocks this? @skonto |
retest this please |
cc @cloud-fan @vanzin, could any active committers help review this? thanks in advance. |
Test build #115066 has finished for PR 25609 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
retest this please |
Test build #115069 has finished for PR 25609 at commit
|
Kubernetes integration test starting |
Test build #115072 has finished for PR 25609 at commit
|
Kubernetes integration test status success |
Does this have an integration test now? |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #122767 has finished for PR 25609 at commit
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
retest this please |
Test build #127901 has finished for PR 25609 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
What changes were proposed in this pull request?
Changes in this pull request will support users to define
HADOOP_CONF_DIR
andspark.kubernetes.hadoop.configMapName
at the same time. When both of them are defined, Spark will take precedence over the config map to be mounted on the driver pod. This enables the spark client process to communicate any Hadoop cluster if it needs.Why are the changes needed?
The BasicDriverFeatureStep for Spark on Kubernetes will upload the files/jars specified by --files/–jars to a Hadoop compatible file system configured by spark.kubernetes.file.upload.path. While using HADOOP_CONF_DIR, the spark-submit process can recognize the file system, but when using spark.kubernetes.hadoop.configMapName which only will be mount on the Pods not applied back to our client process.
Other related spark configurations
Does this PR introduce any user-facing change?
I guess this pr will now use config map of Hadoop from k8s cluster to the local client process if there are files to upload to Hadoop.
How was this patch tested?
manually tested with spark + k8s cluster + standalone kerberized hdfs cluster
add an unit test