The purpose of this article is to explain how to set up Random Forest
Purpose
The Random Forest model is a form of multivariate analysis. It uses an ensemble learning method for classification and regression. The model constructs a "forest" of decision trees and generally does not overfit.
Limitations:
- If a column has been selected for the model, it cannot have nulls (unless you impute)
- Does not work with the Analytics Model panel in Spotfire
Prep:
- First run data description to understand underlying data distributions
- Investigate collinearity of input variables using PCA
- Data that you want to predict on should be appended to the input data set, but have missing values in the response column
- This technique can be used to make a hold-out set for validation
Steps to Run:
- Go to the Tools menu
- Select the Data Science option
- Select the Random Forest option
- Enter the data table, response column, predictor columns, number of trees, imputation choice and name.
Random Forest Inputs
- Data table
- Response column (dependent variable)
- Predictor columns (independent variables used to predict dependent variables)
- Number of trees
- Checkbox to impute missing values (yes/no)
- Name
Outputs:
- There are 6 outputs.
- The OOB Error Rate - Number of Trees line chart shows the reduction in error as more trees are added to the ensemble
- The Definitions text area provides an explanation of the error rate and % variance.
- The Training Set Stats table shows a print out of summary information on the model parameters and goodness of fit
- The Importance per Variable bar chart shows which variables in the model were the most important.
- The Actual vs Predicted scatter plot shows the values of the predicted response variable versus the actual values of the response variable in the data set.
- The predicted response vs predictor column scatter plot shows how closely the model predicted the response column based on the unique values of all the variables in the model.
Note: The new prediction data will be added to a column in the original data table
Try using other visualization and color by the results predicted.Example:
Interpretation
- In the Actuals vs Predicted scatter plot, the closer the markers are to the straight line fit, the more accurate the model.
How to filter a subset of data
- Open the filter panel by clicking the filter icon on the top bar.
- Choose the correct filtering scheme
- Click " Refresh data table" on visualization.
- This syncs with all the other visualization on Random Forest.
See RAI Random Forest video below
Data Science Toolkit: Random Forest from Ruths.ai on Vimeo.
For additional information on RAI Data Science Toolkit documentation, click here.