Skip to content

Files

Latest commit

 

History

History

tfx_cab_classification

README.md

The taxi-cab-classification-pipeline.py sample runs a pipeline with TensorFlow's transform and model-analysis components.

The dataset

This sample is based on the model-analysis example here.

The sample trains and analyzes a model based on the Taxi Trips dataset released by the City of Chicago.

Note: This site provides applications using data that has been modified for use from its original source, www.cityofchicago.org, the official website of the City of Chicago. The City of Chicago makes no claims as to the content, accuracy, timeliness, or completeness of any of the data provided at this site. The data provided at this site is subject to change at any time. It is understood that the data provided at this site is being used at one’s own risk.

Read more about the dataset in Google BigQuery. Explore the full dataset in the BigQuery UI.

Requirements

Preprocessing and model analysis use Apache Beam.

When run with the cloud mode (instead of the local mode), those steps use Google Cloud DataFlow for running the Beam pipelines.

Therefore, you must enable the DataFlow API for the given GCP project if you want to use cloud as the mode for either preprocessing or analysis. See the guide to enabling the DataFlow API.

For On-Premise cluster, you need to create a Persistent Volume (PV) if the Dynamic Volume Provisioning is not enabled. The capacity of the PV needs at least 1Gi.

Compiling the pipeline template

Follow the guide to building a pipeline to install the Kubeflow Pipelines SDK.

For On-Premise cluster, update the platform to onprem in taxi-cab-classification-pipeline.py.

sed -i.sedbak s"/platform = 'GCP'/platform = 'onprem'/"  taxi-cab-classification-pipeline.py

Then run the following command to compile the sample Python into a workflow specification. The specification takes the form of a YAML file compressed into a .tar.gz file.

dsl-compile --py taxi-cab-classification-pipeline.py --output taxi-cab-classification-pipeline.tar.gz

Deploying the pipeline

Open the Kubeflow pipelines UI. Create a new pipeline, and then upload the compiled specification (.tar.gz file) as a new pipeline template.

  • GCP The pipeline requires two arguments:

    1. The name of a GCP project.
    2. An output directory in a Google Cloud Storage bucket, of the form gs://<BUCKET>/<PATH>.
  • On-Premise For On-Premise cluster, the pipeline will create a Persistent Volume Claim (PVC), and download automatically the source date to the PVC.

    1. The output is PVC mount point for the containers, can be set to /mnt.
    2. The project can be set to taxi-cab-classification-pipeline-onprem.
    3. If the PVC mounted to /mnt, the value of below parameters need to be set as following:
    • column-names: /mnt/pipelines/samples/core/tfx_cab_classification/taxi-cab-classification/column-names.json
    • train: /mnt/pipelines/samples/core/tfx_cab_classification/taxi-cab-classification/train.csv
    • evaluation: /mnt/pipelines/samples/core/tfx_cab_classification/taxi-cab-classification/eval.csv
    • preprocess-module: /mnt/pipelines/samples/core/tfx_cab_classification/taxi-cab-classification/preprocessing.py

Components source

Preprocessing: source code container

Training: source code container

Analysis: source code container

Prediction: source code container