Large tabular data & AutoAI

4 min readDec 1, 2022

written by: Lukasz Cmielowski, PhD, Thomas Parnell

In Cloud Pak for Data 4.6, Watson Studio AutoAI is introducing support for large tabular data. Data sets up to 100 GB are consumed using the combination of ensembling and incremental learning. Adoption of BatchedTreeEnsembleClassifier and BatchedTreeEnsembleRegressor from Snap Machine Learning allows for adding “partial_fit()” capabilities (training on batches) to classical algorithms:

Classifiers

ExtraTreesClassifier
XGBClassifier
LightGBMClassifier
RandomForestClassifier
SnapRandomForestClassifier
SnapBoostingMachineClassifier

Regressors

ExtraTreesRegressor
LightGBMRegressor
RandomForestRegressor
SnapBoostingMachineRegressor
SnapRandomForestRegressor
XGBRegressor

Snap ML’s BatchedTreeEnsemble capability enables incremental training for any of the base ensembles listed above. It achieves this by breaking the ensembles of trees into a number of smaller sub-ensembles. Each sub-ensemble is trained on a new batch of data. Then, internally boosting is applied across batches to ensure that, as each of those new sub-ensembles is trained, any errors in the previous sub-ensembles are corrected, thus incrementally improving overall with each consumed batch. Note that the total number of trees in the BatchedTreeEnsemble remains the same as in the base ensemble (no added inference complexity), but by incrementally training the ensemble over a much larger dataset, visible improvement in accuracy over the base ensemble can be achieved. And all that without the need for extra memory.

BatchedTreeEnsemble architecture flowchart.

Benchmarking BatchedTreeEnsemble on Criteo dataset.

New ensemble estimators (BatchedTreeEnsemble) have been added to the list of estimators. AutoAI experiment produces a 5th pipeline (BatchedTreeEnsemble) per each supported estimator. That extra pipeline has partial_fit capabilities; it can be trained on batches of data. The AutoAI generated notebook "Incremental learning notebook" contains the code to keep training the model using all batches of data.

The flow

AutoAI uses a sample of data to build the BatchedTreeEnsemble pipelines. Sampling type can be modified by the user. The supported sampling types are: first values (reading the data from the beginning till the cutoff point), stratified, and random. The default sampling technique is set to random.

Next, AutoAI produces the Incremental learning notebook that contains the code to continue training on all batches of data.

The notebook

The generated notebook uses a torch compatible DataLoader called ExperimentIterableDataset. This data loader can work with various data sources like: DB2, PostgreSQL, Amazon S3, Snowflake, and more.

In the next step, the code downloads the BatchedTreeEnsemble model from the completed AutoAI experiment using the get_pipeline() method.

Finally, the model is trained using all batches of data (partial_fit()). Learning curve, scalability, and performance of model charts are displayed.

The notebook can be easily customized to:

use a different data loader (must return batches of data as Pandas DataFrames)
support the custom scorer function (metrics) during batched based training
include learning stop constraints (e.g., stop if the model accuracy reaches a specific threshold)
be run outside the Watson Studio eco-system (e.g., local infrastructure)

Let’s summarize the key features of AutoAI’s support for large tabular data:

Support for large tabular data set without the need for extra resources
Option to stop and continue training at any time and any infrastructure
Full transparency and flexibility of the training procedure
Model storage and deployment with just few lines of code.

Large tabular data & AutoAI

Classifiers

Regressors

The flow

The notebook

References:

Written by Lukasz Cmielowski, PhD

Responses (1)