Usage

sapientml generates source code to train and predict a machine learning model from a CSV-formatted dataset and requirements of a machine learning task to be solved.

SapientML class

sapientml provides SapientML class that provides the top level API of SapientML. In the constructor of SapientML, you firstly need to set target_columns as a requirement of the task. target_columns specifies which the task is to predict. Second, you can set task_type from classification or regression as a type of machine learning task. You can also skip setting task_type and in that case SapientML automatially suggests task type by looking into values of the target columns.

from sapientml import SapientML

cls = SapientML(
    target_columns=["survived"],
    task_type=None, # suggested automatically from the target columns
)

As well as model classes of the other well-known libraries like scikit-learn, SapientML provides fit and predict to conduct model training and prediction by using generated code.

import pandas as pd
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

train_data = pd.read_csv("https://github.com/sapientml/sapientml/files/12481088/titanic.csv")
train_data, test_data = train_test_split(train_data)
y_true = test_data["survived"].reset_index(drop=True)
test_data.drop(["survived"], axis=1, inplace=True)

cls.fit(train_data, output_dir="./outputs")
y_pred = cls.predict(test_data)

print(f"F1 score: {f1_score(y_true, y_pred)}")

Generated source code

After calling fit, you can get generated source code at ./outputs folder. Here is the example of files generated by fit:

outputs
├── 1_script.py
├── 2_script.py
├── 3_script.py
├── final_predict.py
├── final_script.out.json
├── final_script.py
├── final_train.py
└── lib
    └── sample_dataset.py

1_script.py, 2_script.py, and 3_script.py are scripts of the hold-out validation using the preprocessors and the top-3 most plausible models. final_script.py is the script that selects the model actually achieved the highest score of the top-3 models, and final_script.out.json contains its score. final_train.py is the script for training the selected model, and final_predict.py is the the script for prediction using the model trained by final_train.py. lib folder contains modules that the above scripts uses.

Using generated code as a model

After calling fit, you can also get cls.model, which is a GeneratedModel instance that contains generated source code and .pkl files of preprocessers and a actual machine learning model. The instance also asts a usual model providing fit and predict.

cls.fit(train_data)
model = cls.model # obtains GeneratedModel instance

You can get the set of source code and .pkl files by referring model.files or by looking into ./outputs folder after calling model.save("./model"). Here is the example of files contained in GeneratedModel:

model
├── final_predict.py
├── final_train.py
├── lib
│   └── sample_dataset.py
├── model.pkl
├── ordinalEncoder.pkl
├── simpleimputer-numeric.pkl
└── simpleimputer-string.pkl

The actual behavior of model.fit is a subprocess executing final_train.py. Beware that model.fit(another_train_data) is not retraining the existing model but buiding a new one. model.predict creates a subprocess executing final_predict.py as well.

SapientML provides a utility function to restore the SapientML instance from generated model.

import pickle

cls.fit(train_data)
with open("model.pkl", "wb") as f:
    pickle.dump(sml.model, f)

with open("model.pkl", "rb") as f:
    model = pickle.load(f)
sml = SapientML.from_pretrained(model)