Usage

sapientml generates source code to train and predict a machine learning model from a CSV-formatted dataset and requirements of a machine learning task to be solved.

SapientML class

sapientml provides SapientML class that provides the top level API of SapientML. In the constructor of SapientML, you firstly need to set target_columns as a requirement of the task. target_columns specifies which the task is to predict. Second, you can set task_type from classification or regression as a type of machine learning task. You can also skip setting task_type and in that case SapientML automatially suggests task type by looking into values of the target columns.

from sapientml import SapientML

cls = SapientML(
    target_columns=["survived"],
    task_type=None, # suggested automatically from the target columns
)

As well as model classes of the other well-known libraries like scikit-learn, SapientML provides fit and predict to conduct model training and prediction by using generated code.

import pandas as pd
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

train_data = pd.read_csv("https://github.com/sapientml/sapientml/files/12481088/titanic.csv")
train_data, test_data = train_test_split(train_data)
y_true = test_data["survived"].reset_index(drop=True)
test_data.drop(["survived"], axis=1, inplace=True)

cls.fit(train_data, output_dir="./outputs")
y_pred = cls.predict(test_data)

print(f"F1 score: {f1_score(y_true, y_pred)}")

Generated source code

After calling fit, you can get generated source code at ./outputs folder. Here is the example of files generated by fit:

outputs
├── 1_script.py
├── 2_script.py
├── 3_script.py
├── final_predict.py
├── final_script.out.json
├── final_script.py
├── final_train.py
└── lib
    └── sample_dataset.py

1_script.py, 2_script.py, and 3_script.py are scripts of the hold-out validation using the preprocessors and the top-3 most plausible models. final_script.py is the script that selects the model actually achieved the highest score of the top-3 models, and final_script.out.json contains its score. final_train.py is the script for training the selected model, and final_predict.py is the the script for prediction using the model trained by final_train.py. lib folder contains modules that the above scripts uses.

Using generated code as a model

After calling fit, you can also get cls.model, which is a GeneratedModel instance that contains generated source code and .pkl files of preprocessers and a actual machine learning model. The instance also asts a usual model providing fit and predict.

cls.fit(train_data)
model = cls.model # obtains GeneratedModel instance

You can get the set of source code and .pkl files by referring model.files or by looking into ./outputs folder after calling model.save("./model"). Here is the example of files contained in GeneratedModel:

model
├── final_predict.py
├── final_train.py
├── lib
│   └── sample_dataset.py
├── model.pkl
├── ordinalEncoder.pkl
├── simpleimputer-numeric.pkl
└── simpleimputer-string.pkl

The actual behavior of model.fit is a subprocess executing final_train.py. Beware that model.fit(another_train_data) is not retraining the existing model but buiding a new one. model.predict creates a subprocess executing final_predict.py as well.

SapientML provides a utility function to restore the SapientML instance from generated model.

import pickle

cls.fit(train_data)
with open("model.pkl", "wb") as f:
    pickle.dump(sml.model, f)

with open("model.pkl", "rb") as f:
    model = pickle.load(f)
sml = SapientML.from_pretrained(model)