Building ML pipeline with Kedro

Introduce ML pipeline and Kedro

Yoshi Gao included in ML short articles

2022-11-29 596 words 3 minutes

/building_ml_pipeline_with_kedro/featured-image.png

Contents

ML pipeline high level overview

https://github.com/kennethleungty/Anomaly-Detection-Pipeline-Kedro

1. Data Retrieval / Ingestion:

Ingest the data from different source like batch logging or existed public dataset.

2. Data Preparation:
This is the part which may cost the most of your time.

Data cleaning:
- Remove the records which contains the outlier
- How to handle with the null feature? Remove the records or performing feature imputation?
- How to deal with sparse categorical feature? (ex: Has variable category from 1~10000 and each category only contained in one record.) Deprecate or transform to other meaningful feature?
Data transformation / Feature engineering:
- The data type used in memory
- Handle the categorical feature to label encoded feature
- Transform the features to another feature by the existed knowledge
- Transform the skewed distribution to normal distribution
- Performing normalization or standardization

3. Model training:

Training the model given by the data scientist
Decide how to split the processed dataset to training set and testing set

4. Model Evaluation

Decide the evaluation metrics for the model
For classification task, you may consider to use AUROC, F1-score, confusion matrix and also can find the best decision threshold by change it iteratively
For regression task, we may consider to use RMS, MSE for evaluation

5. Deploy in production

How you manage the model whenever there’s retrain pipeline performed?
How to design the cache mechanism to reduce the inference time?
Does the data bias between offline and online data need to be consider in?

6. Monitoring of system

Service throughput, memory, CPU usage
Data drifting, model drifting detection
For the case which can not get the labeled data from the online production environment, detect the data drifting between the different time slot is a good way to validate the same pipeline process or model still works for the current environment

Next, find the tool

Due to machine learning is a data driven system, it is just the beginning while the model is just deploy to the environment. A system to help engineer to maintain a lifecycle is important, so next I’m going to introduce a tool to integrate the works between data scientist and data engineer called Kedro.

What’s Kedro and what it do?

A tool to package the pipeline into a CLI from data scientist experimental code to production code

Use cookiecutter(template generator) to generate the standard pipeline structure from Data Ingestion(step 1) to model evaluation(step 4)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
project-template    # Project folder
├── conf            # Configuration files
├── data            # Local project data
├── docs            # Documentation
├── logs            # Logs of pipeline runs
├── notebooks       # Exploratory Jupyter notebooks 
├── pyproject.toml  # Identifies the project root
├── setup.cfg       # Configuration options for testing and linting
├── README.md       # README.md explaining your project
├── setup.cfg       # Configuration options for testing and linting
└── src             # Source code for pipelines

Can use CLI to execute pipeline every time when model retraining happened

1
kedro run

Can store the result of each step to the corresponded directory location by catalog.yml, its really helpful to prevent us from doing the duplicated work while the fail happened in any step of the pipeline. We can choose to execute the slice of the pipeline we want
Data pipeline visualization, a quick overview to know the relationship between the nodes help us to understand the dataflow for training and evaluation
A free opensource project
Use parameters.yml to manage the hyperparameter every time we have modify the arguments on pipeline training, the relation between the hyperparameter and affected nodes can also showing up in the visualization part.