Usage
sapientml generates source code to train and predict a machine learning model from a CSV-formatted dataset and requirements of a machine learning task to be solved.
SapientML class
sapientml provides SapientML
class that provides the top level API of SapientML. In the constructor of SapientML
, you firstly need to set target_columns
as a requirement of the task. target_columns
specifies which the task is to predict. Second, you can set task_type
from classification
or regression
as a type of machine learning task. You can also skip setting task_type
and in that case SapientML automatially suggests task type by looking into values of the target columns.
from sapientml import SapientML
cls = SapientML(
target_columns=["survived"],
task_type=None, # suggested automatically from the target columns
)
As well as model classes of the other well-known libraries like scikit-learn, SapientML
provides fit
and predict
to conduct model training and prediction by using generated code.
import pandas as pd
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
train_data = pd.read_csv("https://github.com/sapientml/sapientml/files/12481088/titanic.csv")
train_data, test_data = train_test_split(train_data)
y_true = test_data["survived"].reset_index(drop=True)
test_data.drop(["survived"], axis=1, inplace=True)
cls.fit(train_data, output_dir="./outputs")
y_pred = cls.predict(test_data)
print(f"F1 score: {f1_score(y_true, y_pred)}")
Generated source code
After calling fit, you can get generated source code at ./outputs
folder. Here is the example of files generated by fit
:
outputs
├── 1_script.py
├── 2_script.py
├── 3_script.py
├── final_predict.py
├── final_script.out.json
├── final_script.py
├── final_train.py
└── lib
└── sample_dataset.py
1_script.py
, 2_script.py
, and 3_script.py
are scripts of the hold-out validation using the preprocessors and the top-3 most plausible models.
final_script.py
is the script that selects the model actually achieved the highest score of the top-3 models, and final_script.out.json
contains its score.
final_train.py
is the script for training the selected model, and final_predict.py
is the the script for prediction using the model trained by final_train.py
.
lib
folder contains modules that the above scripts uses.
Using generated code as a model
After calling fit
, you can also get cls.model
, which is a GeneratedModel
instance that contains generated source code and .pkl
files of preprocessers and a actual machine learning model. The instance also asts a usual model providing fit
and predict
.
cls.fit(train_data)
model = cls.model # obtains GeneratedModel instance
You can get the set of source code and .pkl
files by referring model.files
or by looking into ./outputs
folder after calling model.save("./model")
. Here is the example of files contained in GeneratedModel
:
model
├── final_predict.py
├── final_train.py
├── lib
│ └── sample_dataset.py
├── model.pkl
├── ordinalEncoder.pkl
├── simpleimputer-numeric.pkl
└── simpleimputer-string.pkl
The actual behavior of model.fit
is a subprocess executing final_train.py
.
Beware that model.fit(another_train_data)
is not retraining the existing model but buiding a new one.
model.predict
creates a subprocess executing final_predict.py
as well.
SapientML
provides a utility function to restore the SapientML
instance from generated model.
import pickle
cls.fit(train_data)
with open("model.pkl", "wb") as f:
pickle.dump(sml.model, f)
with open("model.pkl", "rb") as f:
model = pickle.load(f)
sml = SapientML.from_pretrained(model)