API

Main class

class sapientml.SapientML(target_columns: list[str], task_type: Literal['classification', 'regression'] | None = None, adaptation_metric: str | None = None, split_method: Literal['random', 'time', 'group'] = 'random', split_seed: int = 17, split_train_size: float = 0.75, split_column_name: str | None = None, time_split_num: int = 5, time_split_index: int = 4, split_stratification: bool | None = None, model_type: str = 'sapientml', **kwargs)

The constructor of SapientML.

You can pass all the keyword arguments for configurations of PipelineGenerators and CodeBlockGenerators from plugins.

Parameters:
  • target_columns (list[str]) – Names of target columns.

  • task_type ('classification', 'regression' or None) – Specify task type classification, regression.

  • adaptation_metric (str) – Metric for evaluation. Classification: ‘f1’, ‘auc’, ‘ROC_AUC’, ‘accuracy’, ‘Gini’, ‘LogLoss’, ‘MCC’(Matthews correlation coefficient), ‘QWK’(Quadratic weighted kappa). Regression: ‘r2’, ‘RMSLE’, ‘RMSE’, ‘MAE’.

  • split_method ('random', 'time', or 'group') – Method of train-test split. ‘random’ uses random split. ‘time’ requires ‘split_column_name’. This sorts the data rows based on the column, and then splits data. ‘group’ requires ‘split_column_name’. This splits the data so that rows with the same value of ‘split_column_name’ are not placed in both training and test data.

  • split_seed (int) – Random seed for train-test split. Ignored when split_method=’time’.

  • split_train_size (float) – The ratio of training size to input data. Ignored when split_method=’time’.

  • split_column_name (str) – Name of the column used to split. Ignored when split_method=’random’

  • time_split_num (int) – Passed to TimeSeriesSplit’s n_splits. Valid only when split_method=’time’.

  • time_split_index (int) – The index of the split from TimeSeriesSplit. Valid only when split_method=’time’.

  • split_stratification (bool) – To perform stratification in train-test split. Valid only when task_type=’classification’.

fit(training_data: DataFrame | str, validation_data: DataFrame | str | None = None, test_data: DataFrame | str | None = None, save_datasets_format: Literal['csv', 'pickle'] = 'pickle', csv_encoding: Literal['UTF-8', 'SJIS'] = 'UTF-8', csv_delimiter: str = ',', ignore_columns: list[str] | None = None, output_dir: str = './outputs', codegen_only: bool = False)

Generate ML scripts for input data.

Parameters:
  • training_data (pandas.DataFrame or str) – Training dataframe. When str, this is regarded as a file path.

  • validation_data (pandas.DataFrame, str or None) – Validation dataframe. When str, this is regarded as file paths. When None, validation data is extracted from training data by split.

  • test_data (pandas.DataFrame, str, or None) – Test dataframes. When str, they are regarded as file paths. When None, test data is extracted from training data by split.

  • save_datasets_format ('csv' or 'pickle') – Data format when the input dataframes are written to files. Ignored when all inputs are specified as file path.

  • csv_encoding ('UTF-8' or 'SJIS') – Encoding method when csv files are involved. Ignored when only pickle files are involved.

  • csv_delimiter (str) – Delimiter to read csv files.

  • ignore_columns (list[str]) – Column names which must not be used and must be dropped.

  • output_dir (str) – Output directory.

  • codegen_only (bool) – Do not conduct fit() of GeneratedModel if True.

Returns:

self – SapientML object itself.

Return type:

SapientML

static from_pretrained(model)

The factory method of SapientML from a pretrained model built by source code previously generated by SapientML.

model must be either pickle filename, pickle bytes-like object, or deserialized object

Parameters:

model (str, bytes-like object, or GeneratedModel) – A pretrained model built by source code previously generated by SapientML.

Returns:

sml – a new SapientML instance loaded from the pretrained model.

Return type:

SapientML

predict(test_data: DataFrame)

Predicts the output of the test_data.

Parameters:

test_data (pd.DataFrame) – Dataframe used for predicting the result.

Returns:

result – It returns the prediction_result.csv result in dataframe format.

Return type:

pd.DataFrame

class sapientml.GeneratedModel(input_dir: PathLike, save_datasets_format: Literal['csv', 'pickle'], timeout: int, csv_encoding: Literal['UTF-8', 'SJIS'], csv_delimiter: str, params: dict)

The constructor of GeneratedModel. Instantiating this class by yourself is not intended.

Parameters:
  • input_dir (PathLike) – Directory path containing training/prediction scripts and trained models.

  • save_datasets_format ('csv' or 'pickle') – Data format when the input dataframes are written to files. Ignored when all inputs are specified as file path.

  • timeout (int) – Timeout for the execution of training and prediction.

  • csv_encoding ('UTF-8' or 'SJIS') – Encoding method when csv files are involved. Ignored when only pickle files are involved.

  • csv_delimiter (str) – Delimiter to read csv files.

fit(X: DataFrame, y: DataFrame | Series | None = None)

Generate ML scripts for input data.

Parameters:
  • X (pandas.DataFrame) – Training dataframe. Contains target values if y is None.

  • y (pandas.DataFrame or pandas.Series) – The target values.

Returns:

self – GeneratedModel object itself

Return type:

GeneratedModel

predict(X: DataFrame)

Predicts the output of the test_data and store in the prediction_result.csv.

Parameters:

X (pd.DataFrame) – Dataframe used for predicting the result.

Returns:

result_df – It returns the prediction_result.csv result in dataframe format.

Return type:

pd.DataFrame

save(output_dir: PathLike)

Save generated code to output_dir folder

Parameters:

output_dir (Path-like object) – Training dataframe.

Returns:

self – GeneratedModel object itself

Return type:

GeneratedModel

Config parameters

class sapientml.Config(*, initial_timeout: int = 600, timeout_for_test: int = 0, cancel: CancellationToken | None = None, project_name: str | None = None, debug: bool = False)

Configuration arguments for sapientml.generator.CodeBlockGenerator and/or sapientml.generator.PipelineGenerator.

initial_timeout

Timelimit to execute each generated script. Ignored when hyperparameter_tuning=True and hyperparameter_tuning_timeout is set.

Type:

int

timeout_for_test

Timelimit to execute test script (final_script) and Visualization.

Type:

int

cancel

Object to interrupt evaluations.

Type:

CancellationToken, optional

project_name

Project name.

Type:

str, optional

debug

Debug mode or not.

Type:

bool

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

__init__ uses __pydantic_self__ instead of the more common self for the first arg to allow self as a field name.

class sapientml_core.SapientMLConfig(*, initial_timeout: int = 600, timeout_for_test: int = 0, cancel: CancellationToken | None = None, project_name: str | None = None, debug: bool = False, n_models: int = 3, seed_for_model: int = 42, id_columns_for_prediction: list[str] | None = None, use_word_list: list[str] | dict[str, list[str]] | None = None, hyperparameter_tuning: bool = False, hyperparameter_tuning_n_trials: int = 10, hyperparameter_tuning_timeout: int = 0, hyperparameter_tuning_random_state: int = 1023, predict_option: Literal['default', 'probability'] = 'default', permutation_importance: bool = True, add_explanation: bool = False)

Configuration arguments for SapientMLGenerator.

n_models

Number of output models to be tried.

Type:

int, default 3

seed_for_model

Random seed for models such as RandomForestClassifier.

Type:

int, default 42

id_columns_for_prediction

Name of the dataframe columns that outputs the prediction result.

Type:

Optional[list[str]], default None

use_word_list

List of words to be used as features when generating explanatory variables from text. If dict type is specified, key must be a column name and value must be a list of words.

Type:

Optional[Union[list[str], dict[str, list[str]]]], default None

hyperparameter_tuning

On/Off of hyperparameter tuning.

Type:

bool, default False

hyperparameter_tuning_n_trials

The number of trials of hyperparameter tuning.

Type:

int, default 10

hyperparameter_tuning_timeout

Time limit for hyperparameter tuning in each generated script. Ignored when hyperparameter_tuning is False.

Type:

int, default 0

hyperparameter_tuning_random_state

Random seed for hyperparameter tuning.

Type:

int, default 1023

predict_option

Specify predict method (default: predict(), probability: predict_proba().)

Type:

Literal[“default”, “probability”], default “default”

permutation_importance

On/Off of outputting permutation importance calculation code.

Type:

bool, default True

add_explanation

If True, outputs ipynb files including EDA and explanation.

Type:

bool, default False

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

__init__ uses __pydantic_self__ instead of the more common self for the first arg to allow self as a field name.

postinit()

Set initial_timeout and hyperparameter_tuning_timeout.

If initial_timeout is set as None and hyperparameter_tuning is false, set initial_timeout as INITIAL_TIMEOUT.

For hyperparameter_tuning_timeout, if both initial_timeout and hyperparameter_tuning_timeout are set as None, set hyperparameter_tuning_timeout as INITIAL_TIMEOUT.

If initial_timeout is set and hyperparameter_tuning is True, and hyperparameter_tuning_timeout is None :

Set the hyperparameter_tuning_timeout to unlimited.(hyperparameter_tuning_timeout = self.initial_timeout.) Since initial_timeout always precedes hyperparameter_tuning_timeout, it can be expressed that there is no time limit for hyperparameters during actual execution.

class sapientml_preprocess.PreprocessConfig(*, initial_timeout: int = 600, timeout_for_test: int = 0, cancel: CancellationToken | None = None, project_name: str | None = None, debug: bool = False, use_pos_list: list[str] | None = ['名詞', '動詞', '助動詞', '形容詞', '副詞'], use_word_stemming: bool = True)

Configuration arguments for sapientml_preprocess.Preprocess class.

use_pos_list

List of parts-of-speech to be used during text analysis. This variable is used for japanese texts analysis. Select the part of speech below. “名詞”, “動詞”, “形容詞”, “形容動詞”, “副詞”.

Type:

Optional[list[str]]

use_word_stemming

Specify whether or not word stemming is used. This variable is used for japanese texts analysis.

Type:

bool default True

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

__init__ uses __pydantic_self__ instead of the more common self for the first arg to allow self as a field name.