Run in Google Colab

Sparse Inputs

SciKeras supports sparse inputs (X/features). You don’t have to do anything special for this to work, you can just pass a sparse matrix to fit().

In this notebook, we’ll demonstrate how this works and compare memory consumption of sparse inputs to dense inputs.

Setup

[1]:
!pip install memory_profiler
%load_ext memory_profiler
Collecting memory_profiler
  Downloading memory_profiler-0.61.0-py3-none-any.whl.metadata (20 kB)
Requirement already satisfied: psutil in /home/runner/work/scikeras/scikeras/.venv/lib/python3.12/site-packages (from memory_profiler) (5.9.8)
Downloading memory_profiler-0.61.0-py3-none-any.whl (31 kB)
Installing collected packages: memory_profiler
Successfully installed memory_profiler-0.61.0
[2]:
import warnings
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
from tensorflow import get_logger
get_logger().setLevel('ERROR')
warnings.filterwarnings("ignore", message="Setting the random state for TF")
[3]:
try:
    import scikeras
except ImportError:
    !python -m pip install scikeras
[4]:
import scipy
import numpy as np
from scikeras.wrappers import KerasRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
import keras

Data

The dataset we’ll be using is designed to demostrate a worst-case/best-case scenario for dense and sparse input features respectively. It consists of a single categorical feature with equal number of categories as rows. This means the one-hot encoded representation will require as many columns as it does rows, making it very ineffienct to store as a dense matrix but very efficient to store as a sparse matrix.

[5]:
N_SAMPLES = 20_000  # hand tuned to be ~4GB peak

X = np.arange(0, N_SAMPLES).reshape(-1, 1)
y = np.random.uniform(0, 1, size=(X.shape[0],))

Model

The model here is nothing special, just a basic multilayer perceptron with one hidden layer.

[6]:
def get_clf(meta) -> keras.Model:
    n_features_in_ = meta["n_features_in_"]
    model = keras.models.Sequential()
    model.add(keras.layers.Input(shape=(n_features_in_,)))
    # a single hidden layer
    model.add(keras.layers.Dense(100, activation="relu"))
    model.add(keras.layers.Dense(1))
    return model

Pipelines

Here is where it gets interesting. We make two Scikit-Learn pipelines that use OneHotEncoder: one that uses sparse_output=False to force a dense matrix as the output and another that uses sparse_output=True (the default).

[7]:
dense_pipeline = Pipeline(
    [
        ("encoder", OneHotEncoder(sparse_output=False)),
        ("model", KerasRegressor(get_clf, loss="mse", epochs=5, verbose=False))
    ]
)

sparse_pipeline = Pipeline(
    [
        ("encoder", OneHotEncoder(sparse_output=True)),
        ("model", KerasRegressor(get_clf, loss="mse", epochs=5, verbose=False))
    ]
)

Benchmark

Our benchmark will be to just train each one of these pipelines and measure peak memory consumption.

[8]:
%memit dense_pipeline.fit(X, y)
peak memory: 5175.21 MiB, increment: 4650.21 MiB
[9]:
%memit sparse_pipeline.fit(X, y)
peak memory: 1001.99 MiB, increment: 40.09 MiB

You should see at least 100x more memory consumption increment in the dense pipeline.

Runtime

Using sparse inputs can have a drastic impact on memory usage, but it often (not always) hurts overall runtime.

[10]:
%timeit dense_pipeline.fit(X, y)
32.8 s ± 9.49 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
[11]:
%timeit sparse_pipeline.fit(X, y)
12.1 s ± 717 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Tensorflow Datasets

Tensorflow provides a whole suite of functionality around the Dataset. Datasets are lazily evaluated, can be sparse and minimize the transformations required to feed data into the model. They are a lot more performant and efficient at scale than using numpy datastructures, even sparse ones.

SciKeras does not (and cannot) support Datasets directly because Scikit-Learn itself does not support them and SciKeras’ outwards API is Scikit-Learn’s API. You may want to explore breaking out of SciKeras and just using TensorFlow/Keras directly to see if Datasets can have a large impact for your use case.

Bonus: dtypes

You might be able to save even more memory by changing the output dtype of OneHotEncoder.

[12]:
sparse_pipline_uint8 = Pipeline(
    [
        ("encoder", OneHotEncoder(sparse_output=True, dtype=np.uint8)),
        ("model", KerasRegressor(get_clf, loss="mse", epochs=5, verbose=False))
    ]
)
[13]:
%memit sparse_pipline_uint8.fit(X, y)
peak memory: 1084.54 MiB, increment: 16.99 MiB