The taxi-cab-classification-pipeline.py
sample runs a pipeline with TensorFlow's transform and model-analysis components.
This sample is based on the model-analysis example here.
The sample trains and analyzes a model based on the Taxi Trips dataset released by the City of Chicago.
Note: This site provides applications using data that has been modified for use from its original source, www.cityofchicago.org, the official website of the City of Chicago. The City of Chicago makes no claims as to the content, accuracy, timeliness, or completeness of any of the data provided at this site. The data provided at this site is subject to change at any time. It is understood that the data provided at this site is being used at one’s own risk.
Read more about the dataset in Google BigQuery. Explore the full dataset in the BigQuery UI.
Preprocessing and model analysis use Apache Beam.
When run with the cloud
mode (instead of the local
mode), those steps use Google Cloud DataFlow for running the Beam pipelines.
Therefore, you must enable the DataFlow API for the given GCP project if you want to use cloud
as the mode for either preprocessing or analysis. See the guide to enabling the DataFlow API.
For On-Premise cluster, you need to create a Persistent Volume (PV) if the Dynamic Volume Provisioning is not enabled. The capacity of the PV needs at least 1Gi.
Follow the guide to building a pipeline to install the Kubeflow Pipelines SDK.
For On-Premise cluster, update the platform
to onprem
in taxi-cab-classification-pipeline.py
.
sed -i.sedbak s"/platform = 'GCP'/platform = 'onprem'/" taxi-cab-classification-pipeline.py
Then run the following command to compile the sample Python into a workflow specification. The specification takes the form of a YAML file compressed into a .tar.gz
file.
dsl-compile --py taxi-cab-classification-pipeline.py --output taxi-cab-classification-pipeline.tar.gz
Open the Kubeflow pipelines UI. Create a new pipeline, and then upload the compiled specification (.tar.gz
file) as a new pipeline template.
-
GCP The pipeline requires two arguments:
- The name of a GCP project.
- An output directory in a Google Cloud Storage bucket, of the form
gs://<BUCKET>/<PATH>
.
-
On-Premise For On-Premise cluster, the pipeline will create a Persistent Volume Claim (PVC), and download automatically the source date to the PVC.
- The
output
is PVC mount point for the containers, can be set to/mnt
. - The
project
can be set totaxi-cab-classification-pipeline-onprem
. - If the PVC mounted to
/mnt
, the value of below parameters need to be set as following:
column-names
:/mnt/pipelines/samples/core/tfx_cab_classification/taxi-cab-classification/column-names.json
train
:/mnt/pipelines/samples/core/tfx_cab_classification/taxi-cab-classification/train.csv
evaluation
:/mnt/pipelines/samples/core/tfx_cab_classification/taxi-cab-classification/eval.csv
preprocess-module
:/mnt/pipelines/samples/core/tfx_cab_classification/taxi-cab-classification/preprocessing.py
- The
Preprocessing: source code container
Training: source code container
Analysis: source code container
Prediction: source code container