7  Predictive modeling

Model

Create and optimize machine learning models for classification and regression tasks.

Calculate

Selecting the type of model you would like to create classification or regression will toggle the available variables to predict. Use the filter menu selected to specify which variables should be included in the model based on criteria generated in other modules (e.g. statistical and feature selection).

This menu is used to specify the training and test data and model cross-validation parameters. Similar to the model>>tune the cross-validation parameters can be manually or automatically set. For example, the selections shown in the example will randomly select 70% of the data (samples) to fit the model which will then be validated on the 305 held out or test data. The model will be internally cross-validated by splitting the training data into 7 folds and then leaving out each fold during the model fit and then testing the performance on the held-out fold. This process will be repeated 3 times and the performance on the training data will be summarized over all the results.

Model methods amd performance summary. For eaxample, this show three models were fitted, RandomForest (rf), Partial Least Squares projections to latent structures (pls) and radial kernel Support Vectoem Machine (svmRadial). The top performing model (rf) is highlighted in green.

Plot and Explore

Model performance for the training and test data and training time can be compared. The y-axis shows the selected model performance metric and x-axis the training time.

This plot is used to visualize the impact of hyperparameters on model performance.

Identify the proportion of miss classified samples for classification models using a confusion matrix. Optionally show actual counts or percent for correct and incorrect classifications.

Visualize variable’s importance or contribution to the model’s performance. Importance for multiple models is calculated based weighted metric of the model’s performance and each variables importance in the model. Importance based on multiple models displays the variables consensus rank (y-axis) across all models and the actual importance in the single highest performing model (x-axis).

Feature selection

Feature selection is used to identify variables which maximize model performance. Optimal variables are identified using recursive feature elimination wherein many models are built from subsets of variables and an optimal model is identified based on which subset yielded the highest performing model.

Calculate

The data menu is used to specify the model type and select target and predictor variables.

The optimize menu is used to specify the algorithm used for the selection. The metric specifies which performance criteria will be used to identify the optimal subset of variables.

The validate menu is used to specify the model cross-validation parameters and size of the automatic hyperparameter tuning grid.

View feature selection methods and results.

Plot and Explore

This visualization displays model performance (y-axis) based on the subset of variables (x-axis). The optimal model is highlighted in red. The plot controls can be used to specify which model metric will be used for the visualization (use calculate to optimized subsets for that metric). The optimal variables can be selected based on the subset function. Options include PickSizeBest which specifies the subset which maximized or minimized the chosen performance metric or PickSizeTolerance which allows for models with less parameters (variables), which are also worse than the optimal model. The accepted decrease in performance is specified as a percent of the metric in tolerance.

This visualizations shows the selected variables (red) importance compared to those which were removed (blue).

Add selected features filter to the row_metadata or remove all non-selected variables from the data set keep selected. Common workflows might include feature selection followed by training wherein the rfe_selected filter can be used to select variables in the model >> data >> filter >> selected menu.