Description
I am working with the Taxi Cab pipeline example and need to replace GCS storage with Minio (S3 compatible) for storing training data, eval data, and to pass data from step to step in argo workflows:
"pipelines/samples/notebooks/KubeFlow Pipeline Using TFX OSS Components.ipynb"
The issue with s3:// protocol support seems to be specific to TFDV/Apache Beam step. Beam does not seem to provide support for S3 in Python client. We are looking for a way right now to change TFDV step to use local/attached storage.
Minio access parameters seem to be properly configured - the validation step is successfully creating several folders in Minio bucket, for example: demo04kubeflow/output/tfx-taxi-cab-classification-pipeline-example-ht94b/validation
The error is on reading or writing any files from the Minio buckets, and it's coming from Tensorflow/Beam tfdv.generate_statistics_from_csv():
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/filesystems.py", line 92, in get_filesystem
raise ValueError('Unable to get the Filesystem for path %s' % path)
ValueError: Unable to get the Filesystem for path s3://ml-pipeline-playground/tfx/taxi-cab-classification/train.csv
Minio files are accessed via s3:// protocol, for example
PipelineTFX4.ipynb.txt
OUTPUT_DIR = 's3://demo04kubeflow/output'
This same step worked fine when train.csv was stored in GCS bucket:
gs://ml-pipeline-playground/tfx/taxi-cab-classification/train.csv
Minio credentials were provided as env variables to ContainerOp:
return dsl.ContainerOp(
name = step_name,
image = DATAFLOW_TFDV_IMAGE,
arguments = [
'--csv-data-for-inference', inference_data,
'--csv-data-to-validate', validation_data,
'--column-names', column_names,
'--key-columns', key_columns,
'--project', project,
'--mode', mode,
'--output', validation_output,
],
file_outputs = {
'schema': '/schema.txt',
}
).add_env_variable(
k8sc.V1EnvVar(
name='S3_ENDPOINT',
value=S3_ENDPOINT,
)).add_env_variable(
k8sc.V1EnvVar(
name='AWS_ENDPOINT_URL',
value='https://{}'.format(S3_ENDPOINT),
)).add_env_variable(
k8sc.V1EnvVar(
name='AWS_ACCESS_KEY_ID',
value=S3_ACCESS_KEY,
)).add_env_variable(
k8sc.V1EnvVar(
name='AWS_SECRET_ACCESS_KEY',
value=S3_SECRET_KEY,
)).add_env_variable(
k8sc.V1EnvVar(
name='AWS_REGION',
value='us-east-1',
)).add_env_variable(
k8sc.V1EnvVar(
name='BUCKET_NAME',
value='demo04kubeflow',
)).add_env_variable(
k8sc.V1EnvVar(
name='S3_USE_HTTPS',
value='1',
)).add_env_variable(
k8sc.V1EnvVar(
name='S3_VERIFY_SSL',
value='1'
))
This pipeline example was created from Jupyter notebook running on the same Kubernetes cluster as Kubeflow Pipelines, Argo, and Minio. Please see attached the Jupyter notebook, and two log files from pipeline execution (validate step). All required files (such as train.csv) were uploaded to Minio from the notebook.
tfx-taxi-cab-classification-pipeline-example-wait.log
Activity
[-]Use Minio in Pipeline examples for reading training data and artifact storage[/-][+]S3 errors in Pipeline examples for reading training data and artifact storage[/+]aronchick commentedon Jan 9, 2019
/cc @jlewi can you page in the right folks here? we're blocked on using this until it's solved.
jlewi commentedon Jan 10, 2019
ack I'll loop in some folks; but it sounds like the issue is actually outside Kubeflow and is in Apache Beam.
You might want to repost the issue in the Apache Beam
https://issues.apache.org/jira/browse/BEAM-2500
Or in the TF Data Validation
https://github.com/tensorflow/data-validation
If your goal is trying to use pipelines did you consider trying to use some other example? Or creating a new one that doesn't use TFX.
jlewi commentedon Jan 10, 2019
It looks like this is a known issue with Apache Beam and has been open for a long time.
https://issues.apache.org/jira/browse/BEAM-2572
vicaire commentedon Mar 26, 2019
Resolving since this is an issue in beam.
aronchick commentedon Mar 26, 2019
I understand our desire to close these issues, but I'd like to suggest we take ownership over the problem. Obviously, most Kuebfolw deployments will run against S3, rather than GCP, so most deployments will now be a problem.
aronchick commentedon Mar 26, 2019
At LEAST we should file it as a bug over there. Are the TFX aware of the issue?
@mameshini - would you mind filing a bug?
mameshini commentedon Mar 26, 2019
@aronchick We are currently implementing a storage management approach that mounts S3 bucket as Kubernetes volume, using s3fs dynamic provisioner. It's not just Apache Beam, it's Keras and other libraries that can't handle S3 or Minio. We can mount S3/GCS/Minio buckets as volumes and access them as Posix file system, with an optional caching layer. I can share working examples soon, fingers crossed performance seems acceptable. I am waiting on filing a bug because we may be able to solve this problem better with Kubernetes storage provisioners. A lot of Python libraries require a file system to work with.
aronchick commentedon Mar 26, 2019
+1!
@vicaire I would suggest we reopen - we need a solution here.
vicaire commentedon Mar 27, 2019
Got it.
Apologies. Looks like I misunderstood the issue. Reopening. Having things working with S3 is a high priority for us.
46 remaining items