Data Science Toolkit Random Forest User Guide: How to set up Random Forest : Ruths.ai Product Support

The purpose of this article is to explain how to set up Random Forest

Purpose

The Random Forest model is a form of multivariate analysis. It uses an ensemble learning method for classification and regression. The model constructs a "forest" of decision trees and generally does not overfit.

Limitations:

If a column has been selected for the model, it cannot have nulls (unless you impute)
Does not work with the Analytics Model panel in Spotfire

Prep:

First run data description to understand underlying data distributions
Investigate collinearity of input variables using PCA
Data that you want to predict on should be appended to the input data set, but have missing values in the response column
- This technique can be used to make a hold-out set for validation

Steps to Run:

Go to the Tools menu
Select the Data Science option
Select the Random Forest option
Enter the data table, response column, predictor columns, number of trees, imputation choice and name.

Random Forest Inputs

Data table
Response column (dependent variable)
Predictor columns (independent variables used to predict dependent variables)
Number of trees
Checkbox to impute missing values (yes/no)
Name

Outputs:

There are 6 outputs.
The OOB Error Rate - Number of Trees line chart shows the reduction in error as more trees are added to the ensemble
The Definitions text area provides an explanation of the error rate and % variance.
The Training Set Stats table shows a print out of summary information on the model parameters and goodness of fit
The Importance per Variable bar chart shows which variables in the model were the most important.
The Actual vs Predicted scatter plot shows the values of the predicted response variable versus the actual values of the response variable in the data set.
The predicted response vs predictor column scatter plot shows how closely the model predicted the response column based on the unique values of all the variables in the model.

Note: The new prediction data will be added to a column in the original data table

Try using other visualization and color by the results predicted.
Example:

Interpretation

In the Actuals vs Predicted scatter plot, the closer the markers are to the straight line fit, the more accurate the model.

Variable importance provides a relative ranking of the variables

How to filter a subset of data

Open the filter panel by clicking the filter icon on the top bar.
Choose the correct filtering scheme
Click " Refresh data table" on visualization.
- This syncs with all the other visualization on Random Forest.

See RAI Random Forest video below

Data Science Toolkit: Random Forest from Ruths.ai on Vimeo.

For additional information on RAI Data Science Toolkit documentation, click here.

Data Science Toolkit Random Forest User Guide: How to set up Random Forest Print