API
Main class
- class sapientml.SapientML(target_columns: list[str], task_type: Literal['classification', 'regression'] | None = None, adaptation_metric: str | None = None, split_method: Literal['random', 'time', 'group'] = 'random', split_seed: int = 17, split_train_size: float = 0.75, split_column_name: str | None = None, time_split_num: int = 5, time_split_index: int = 4, split_stratification: bool | None = None, model_type: str = 'sapientml', **kwargs)
The constructor of SapientML.
You can pass all the keyword arguments for configurations of PipelineGenerators and CodeBlockGenerators from plugins.
- Parameters:
target_columns (list[str]) – Names of target columns.
task_type ('classification', 'regression' or None) – Specify task type classification, regression.
adaptation_metric (str) – Metric for evaluation. Classification: ‘f1’, ‘auc’, ‘ROC_AUC’, ‘accuracy’, ‘Gini’, ‘LogLoss’, ‘MCC’(Matthews correlation coefficient), ‘QWK’(Quadratic weighted kappa). Regression: ‘r2’, ‘RMSLE’, ‘RMSE’, ‘MAE’.
split_method ('random', 'time', or 'group') – Method of train-test split. ‘random’ uses random split. ‘time’ requires ‘split_column_name’. This sorts the data rows based on the column, and then splits data. ‘group’ requires ‘split_column_name’. This splits the data so that rows with the same value of ‘split_column_name’ are not placed in both training and test data.
split_seed (int) – Random seed for train-test split. Ignored when split_method=’time’.
split_train_size (float) – The ratio of training size to input data. Ignored when split_method=’time’.
split_column_name (str) – Name of the column used to split. Ignored when split_method=’random’
time_split_num (int) – Passed to TimeSeriesSplit’s n_splits. Valid only when split_method=’time’.
time_split_index (int) – The index of the split from TimeSeriesSplit. Valid only when split_method=’time’.
split_stratification (bool) – To perform stratification in train-test split. Valid only when task_type=’classification’.
- fit(training_data: DataFrame | str, validation_data: DataFrame | str | None = None, test_data: DataFrame | str | None = None, save_datasets_format: Literal['csv', 'pickle'] = 'pickle', csv_encoding: Literal['UTF-8', 'SJIS'] = 'UTF-8', csv_delimiter: str = ',', ignore_columns: list[str] | None = None, output_dir: str = './outputs', codegen_only: bool = False)
Generate ML scripts for input data.
- Parameters:
training_data (pandas.DataFrame or str) – Training dataframe. When str, this is regarded as a file path.
validation_data (pandas.DataFrame, str or None) – Validation dataframe. When str, this is regarded as file paths. When None, validation data is extracted from training data by split.
test_data (pandas.DataFrame, str, or None) – Test dataframes. When str, they are regarded as file paths. When None, test data is extracted from training data by split.
save_datasets_format ('csv' or 'pickle') – Data format when the input dataframes are written to files. Ignored when all inputs are specified as file path.
csv_encoding ('UTF-8' or 'SJIS') – Encoding method when csv files are involved. Ignored when only pickle files are involved.
csv_delimiter (str) – Delimiter to read csv files.
ignore_columns (list[str]) – Column names which must not be used and must be dropped.
output_dir (str) – Output directory.
codegen_only (bool) – Do not conduct fit() of GeneratedModel if True.
- Returns:
self – SapientML object itself.
- Return type:
- static from_pretrained(model)
The factory method of SapientML from a pretrained model built by source code previously generated by SapientML.
model must be either pickle filename, pickle bytes-like object, or deserialized object
- Parameters:
model (str, bytes-like object, or GeneratedModel) – A pretrained model built by source code previously generated by SapientML.
- Returns:
sml – a new SapientML instance loaded from the pretrained model.
- Return type:
- predict(test_data: DataFrame)
Predicts the output of the test_data.
- Parameters:
test_data (pd.DataFrame) – Dataframe used for predicting the result.
- Returns:
result – It returns the prediction_result.csv result in dataframe format.
- Return type:
pd.DataFrame
- class sapientml.GeneratedModel(input_dir: PathLike, save_datasets_format: Literal['csv', 'pickle'], timeout: int, csv_encoding: Literal['UTF-8', 'SJIS'], csv_delimiter: str, params: dict)
The constructor of GeneratedModel. Instantiating this class by yourself is not intended.
- Parameters:
input_dir (PathLike) – Directory path containing training/prediction scripts and trained models.
save_datasets_format ('csv' or 'pickle') – Data format when the input dataframes are written to files. Ignored when all inputs are specified as file path.
timeout (int) – Timeout for the execution of training and prediction.
csv_encoding ('UTF-8' or 'SJIS') – Encoding method when csv files are involved. Ignored when only pickle files are involved.
csv_delimiter (str) – Delimiter to read csv files.
- fit(X: DataFrame, y: DataFrame | Series | None = None)
Generate ML scripts for input data.
- Parameters:
X (pandas.DataFrame) – Training dataframe. Contains target values if y is None.
y (pandas.DataFrame or pandas.Series) – The target values.
- Returns:
self – GeneratedModel object itself
- Return type:
- predict(X: DataFrame)
Predicts the output of the test_data and store in the prediction_result.csv.
- Parameters:
X (pd.DataFrame) – Dataframe used for predicting the result.
- Returns:
result_df – It returns the prediction_result.csv result in dataframe format.
- Return type:
pd.DataFrame
- save(output_dir: PathLike)
Save generated code to output_dir folder
- Parameters:
output_dir (Path-like object) – Training dataframe.
- Returns:
self – GeneratedModel object itself
- Return type:
Config parameters
- class sapientml.Config(*, initial_timeout: int = 600, timeout_for_test: int = 0, cancel: CancellationToken | None = None, project_name: str | None = None, debug: bool = False)
Configuration arguments for sapientml.generator.CodeBlockGenerator and/or sapientml.generator.PipelineGenerator.
- initial_timeout
Timelimit to execute each generated script. Ignored when hyperparameter_tuning=True and hyperparameter_tuning_timeout is set.
- Type:
int
- timeout_for_test
Timelimit to execute test script (final_script) and Visualization.
- Type:
int
- cancel
Object to interrupt evaluations.
- Type:
CancellationToken, optional
- project_name
Project name.
- Type:
str, optional
- debug
Debug mode or not.
- Type:
bool
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
__init__ uses __pydantic_self__ instead of the more common self for the first arg to allow self as a field name.
- class sapientml_core.SapientMLConfig(*, initial_timeout: int = 600, timeout_for_test: int = 0, cancel: CancellationToken | None = None, project_name: str | None = None, debug: bool = False, n_models: int = 3, seed_for_model: int = 42, id_columns_for_prediction: list[str] | None = None, use_word_list: list[str] | dict[str, list[str]] | None = None, hyperparameter_tuning: bool = False, hyperparameter_tuning_n_trials: int = 10, hyperparameter_tuning_timeout: int = 0, hyperparameter_tuning_random_state: int = 1023, predict_option: Literal['default', 'probability'] = 'default', permutation_importance: bool = True, add_explanation: bool = False)
Configuration arguments for SapientMLGenerator.
- n_models
Number of output models to be tried.
- Type:
int, default 3
- seed_for_model
Random seed for models such as RandomForestClassifier.
- Type:
int, default 42
- id_columns_for_prediction
Name of the dataframe columns that outputs the prediction result.
- Type:
Optional[list[str]], default None
- use_word_list
List of words to be used as features when generating explanatory variables from text. If dict type is specified, key must be a column name and value must be a list of words.
- Type:
Optional[Union[list[str], dict[str, list[str]]]], default None
- hyperparameter_tuning
On/Off of hyperparameter tuning.
- Type:
bool, default False
- hyperparameter_tuning_n_trials
The number of trials of hyperparameter tuning.
- Type:
int, default 10
- hyperparameter_tuning_timeout
Time limit for hyperparameter tuning in each generated script. Ignored when hyperparameter_tuning is False.
- Type:
int, default 0
- hyperparameter_tuning_random_state
Random seed for hyperparameter tuning.
- Type:
int, default 1023
- predict_option
Specify predict method (default: predict(), probability: predict_proba().)
- Type:
Literal[“default”, “probability”], default “default”
- permutation_importance
On/Off of outputting permutation importance calculation code.
- Type:
bool, default True
- add_explanation
If True, outputs ipynb files including EDA and explanation.
- Type:
bool, default False
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
__init__ uses __pydantic_self__ instead of the more common self for the first arg to allow self as a field name.
- postinit()
Set initial_timeout and hyperparameter_tuning_timeout.
If initial_timeout is set as None and hyperparameter_tuning is false, set initial_timeout as INITIAL_TIMEOUT.
For hyperparameter_tuning_timeout, if both initial_timeout and hyperparameter_tuning_timeout are set as None, set hyperparameter_tuning_timeout as INITIAL_TIMEOUT.
If initial_timeout is set and hyperparameter_tuning is True, and hyperparameter_tuning_timeout is None :
Set the hyperparameter_tuning_timeout to unlimited.(hyperparameter_tuning_timeout = self.initial_timeout.) Since initial_timeout always precedes hyperparameter_tuning_timeout, it can be expressed that there is no time limit for hyperparameters during actual execution.
- class sapientml_preprocess.PreprocessConfig(*, initial_timeout: int = 600, timeout_for_test: int = 0, cancel: CancellationToken | None = None, project_name: str | None = None, debug: bool = False, use_pos_list: list[str] | None = ['名詞', '動詞', '助動詞', '形容詞', '副詞'], use_word_stemming: bool = True)
Configuration arguments for sapientml_preprocess.Preprocess class.
- use_pos_list
List of parts-of-speech to be used during text analysis. This variable is used for japanese texts analysis. Select the part of speech below. “名詞”, “動詞”, “形容詞”, “形容動詞”, “副詞”.
- Type:
Optional[list[str]]
- use_word_stemming
Specify whether or not word stemming is used. This variable is used for japanese texts analysis.
- Type:
bool default True
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
__init__ uses __pydantic_self__ instead of the more common self for the first arg to allow self as a field name.