Skip to content

S3 errors in Pipeline examples for reading training data and artifact storage #596

Closed
@mameshini

Description

@mameshini

I am working with the Taxi Cab pipeline example and need to replace GCS storage with Minio (S3 compatible) for storing training data, eval data, and to pass data from step to step in argo workflows:
"pipelines/samples/notebooks/KubeFlow Pipeline Using TFX OSS Components.ipynb"

The issue with s3:// protocol support seems to be specific to TFDV/Apache Beam step. Beam does not seem to provide support for S3 in Python client. We are looking for a way right now to change TFDV step to use local/attached storage.

Minio access parameters seem to be properly configured - the validation step is successfully creating several folders in Minio bucket, for example: demo04kubeflow/output/tfx-taxi-cab-classification-pipeline-example-ht94b/validation

The error is on reading or writing any files from the Minio buckets, and it's coming from Tensorflow/Beam tfdv.generate_statistics_from_csv():

File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/filesystems.py", line 92, in get_filesystem
    raise ValueError('Unable to get the Filesystem for path %s' % path)
ValueError: Unable to get the Filesystem for path s3://ml-pipeline-playground/tfx/taxi-cab-classification/train.csv

Minio files are accessed via s3:// protocol, for example
PipelineTFX4.ipynb.txt
OUTPUT_DIR = 's3://demo04kubeflow/output'

This same step worked fine when train.csv was stored in GCS bucket:
gs://ml-pipeline-playground/tfx/taxi-cab-classification/train.csv

Minio credentials were provided as env variables to ContainerOp:

return dsl.ContainerOp(
        name = step_name,
        image = DATAFLOW_TFDV_IMAGE,
        arguments = [
            '--csv-data-for-inference', inference_data,
            '--csv-data-to-validate', validation_data,
            '--column-names', column_names,
            '--key-columns', key_columns,
            '--project', project,
            '--mode', mode,
            '--output', validation_output,
        ],
        file_outputs = {
            'schema': '/schema.txt',
        }
    ).add_env_variable(
        k8sc.V1EnvVar(
            name='S3_ENDPOINT', 
            value=S3_ENDPOINT, 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='AWS_ENDPOINT_URL', 
            value='https://{}'.format(S3_ENDPOINT), 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='AWS_ACCESS_KEY_ID', 
            value=S3_ACCESS_KEY, 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='AWS_SECRET_ACCESS_KEY', 
            value=S3_SECRET_KEY, 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='AWS_REGION', 
            value='us-east-1', 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='BUCKET_NAME', 
            value='demo04kubeflow', 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='S3_USE_HTTPS', 
            value='1', 
    )).add_env_variable(
        k8sc.V1EnvVar(
            name='S3_VERIFY_SSL', 
            value='1'
    ))

This pipeline example was created from Jupyter notebook running on the same Kubernetes cluster as Kubeflow Pipelines, Argo, and Minio. Please see attached the Jupyter notebook, and two log files from pipeline execution (validate step). All required files (such as train.csv) were uploaded to Minio from the notebook.
tfx-taxi-cab-classification-pipeline-example-wait.log

tfx-taxi-cab-classification-pipeline-example-main.log

PipelineTFX4.ipynb.zip

Activity

changed the title [-]Use Minio in Pipeline examples for reading training data and artifact storage[/-] [+]S3 errors in Pipeline examples for reading training data and artifact storage[/+] on Jan 1, 2019
aronchick

aronchick commented on Jan 9, 2019

@aronchick

/cc @jlewi can you page in the right folks here? we're blocked on using this until it's solved.

jlewi

jlewi commented on Jan 10, 2019

@jlewi
Contributor

ack I'll loop in some folks; but it sounds like the issue is actually outside Kubeflow and is in Apache Beam.

You might want to repost the issue in the Apache Beam
https://issues.apache.org/jira/browse/BEAM-2500

Or in the TF Data Validation
https://github.com/tensorflow/data-validation

If your goal is trying to use pipelines did you consider trying to use some other example? Or creating a new one that doesn't use TFX.

jlewi

jlewi commented on Jan 10, 2019

@jlewi
Contributor

It looks like this is a known issue with Apache Beam and has been open for a long time.
https://issues.apache.org/jira/browse/BEAM-2572

vicaire

vicaire commented on Mar 26, 2019

@vicaire
Contributor

Resolving since this is an issue in beam.

aronchick

aronchick commented on Mar 26, 2019

@aronchick

I understand our desire to close these issues, but I'd like to suggest we take ownership over the problem. Obviously, most Kuebfolw deployments will run against S3, rather than GCP, so most deployments will now be a problem.

aronchick

aronchick commented on Mar 26, 2019

@aronchick

At LEAST we should file it as a bug over there. Are the TFX aware of the issue?

@mameshini - would you mind filing a bug?

mameshini

mameshini commented on Mar 26, 2019

@mameshini
Author

@aronchick We are currently implementing a storage management approach that mounts S3 bucket as Kubernetes volume, using s3fs dynamic provisioner. It's not just Apache Beam, it's Keras and other libraries that can't handle S3 or Minio. We can mount S3/GCS/Minio buckets as volumes and access them as Posix file system, with an optional caching layer. I can share working examples soon, fingers crossed performance seems acceptable. I am waiting on filing a bug because we may be able to solve this problem better with Kubernetes storage provisioners. A lot of Python libraries require a file system to work with.

aronchick

aronchick commented on Mar 26, 2019

@aronchick

+1!

@vicaire I would suggest we reopen - we need a solution here.

vicaire

vicaire commented on Mar 27, 2019

@vicaire
Contributor

Got it.

Apologies. Looks like I misunderstood the issue. Reopening. Having things working with S3 is a high priority for us.

46 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Development

    No branches or pull requests

      Participants

      @aronchick@jlewi@Ark-kun@IronPan@goswamig

      Issue actions

        S3 errors in Pipeline examples for reading training data and artifact storage · Issue #596 · kubeflow/pipelines