S3 errors in Pipeline examples for reading training data and artifact storage

I am working with the Taxi Cab pipeline example and need to replace GCS storage with Minio (S3 compatible) for storing training data, eval data, and to pass data from step to step in argo workflows:
"pipelines/samples/notebooks/KubeFlow Pipeline Using TFX OSS Components.ipynb"

The issue with s3:// protocol support seems to be specific to TFDV/Apache Beam step.  Beam does not seem to provide support for S3 in Python client. We are looking for a way right now to change TFDV step to use local/attached storage.

Minio access parameters seem to be properly configured - the validation step is successfully creating several folders in Minio bucket, for example: demo04kubeflow/output/tfx-taxi-cab-classification-pipeline-example-ht94b/validation

The error is on reading or writing any files from the Minio buckets, and it's coming from Tensorflow/Beam tfdv.generate_statistics_from_csv():  
```
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/filesystems.py", line 92, in get_filesystem
    raise ValueError('Unable to get the Filesystem for path %s' % path)
ValueError: Unable to get the Filesystem for path s3://ml-pipeline-playground/tfx/taxi-cab-classification/train.csv
```

Minio files are accessed via s3:// protocol, for example 
[PipelineTFX4.ipynb.txt](https://github.com/kubeflow/pipelines/files/2711700/PipelineTFX4.ipynb.txt)
OUTPUT_DIR = 's3://demo04kubeflow/output'

This same step worked fine when train.csv was stored in GCS bucket:
gs://ml-pipeline-playground/tfx/taxi-cab-classification/train.csv

Minio credentials were provided as env variables to ContainerOp:
```
return dsl.ContainerOp(
        name = step_name,
        image = DATAFLOW_TFDV_IMAGE,
        arguments = [
            '--csv-data-for-inference', inference_data,
            '--csv-data-to-validate', validation_data,
            '--column-names', column_names,
            '--key-columns', key_columns,
            '--project', project,
            '--mode', mode,
            '--output', validation_output,
        ],
        file_outputs = {
            'schema': '/schema.txt',
        }
    ).add_env_variable(
        k8sc.V1EnvVar(
            name='S3_ENDPOINT', 
            value=S3_ENDPOINT, 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='AWS_ENDPOINT_URL', 
            value='https://{}'.format(S3_ENDPOINT), 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='AWS_ACCESS_KEY_ID', 
            value=S3_ACCESS_KEY, 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='AWS_SECRET_ACCESS_KEY', 
            value=S3_SECRET_KEY, 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='AWS_REGION', 
            value='us-east-1', 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='BUCKET_NAME', 
            value='demo04kubeflow', 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='S3_USE_HTTPS', 
            value='1', 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='S3_VERIFY_SSL', 
            value='1'
    ))
```

This pipeline example was created from Jupyter notebook running on the same Kubernetes cluster as Kubeflow Pipelines, Argo, and Minio.   Please see attached the Jupyter notebook, and two log files from pipeline execution (validate step).  All required files (such as train.csv) were uploaded to Minio from the notebook.
[tfx-taxi-cab-classification-pipeline-example-wait.log](https://github.com/kubeflow/pipelines/files/2711694/tfx-taxi-cab-classification-pipeline-example-wait.log)

[tfx-taxi-cab-classification-pipeline-example-main.log](https://github.com/kubeflow/pipelines/files/2711695/tfx-taxi-cab-classification-pipeline-example-main.log)

[PipelineTFX4.ipynb.zip](https://github.com/kubeflow/pipelines/files/2711736/PipelineTFX4.ipynb.zip)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

S3 errors in Pipeline examples for reading training data and artifact storage #596

46 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

S3 errors in Pipeline examples for reading training data and artifact storage #596

Description

Activity

aronchick commented on Jan 9, 2019

jlewi commented on Jan 10, 2019

jlewi commented on Jan 10, 2019

vicaire commented on Mar 26, 2019

aronchick commented on Mar 26, 2019

aronchick commented on Mar 26, 2019

mameshini commented on Mar 26, 2019

aronchick commented on Mar 26, 2019

vicaire commented on Mar 27, 2019

46 remaining items

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions