Purpose: The purpose of PCA analysis is to convert a set of columns that are possibly correlated into a set of uncorrelated variables called principal components.  It makes sense to combine multiple columns into principal components in order to speed up the run time of a model, reduce noise or otherwise optimize a model.  PCA is one step in the predictive modeling process.  


Information on PCA:


Inputs: 

  1. Data table
  2. Columns that would be put into a predictive model (remember the PCA will attempt to combine)
  3. Tolerance - cutoff in variance (from 0 to 1) of the first PC, reduces the number of PCs returned.  Leave blank to not enforce a tolerance cutoff.


Limitations: 

  • Only runs on complete case data, so is very sensitive to missing data
  • Only runs on numeric data, so categorical variables would need to be converted to binary columns


Prep:

  • Convert categorical variables to binary columns


Steps to Run:

  1. Go to the Tools menu
  2. Select the Data Science option
  3. Select the PCA option
  4. Enter the data table, columns to be included and a tolerance if desired


Outputs: 

  • There are four outputs.
  • PC Values on Origianl Data scatter plot
  • PCA Rotation Matrix table
  • Variance explained by each PC bar chart.  There will be one bar for each principal component.  
  • PCA Plot for top components scatter plot


Interpretation: 


Data Science Toolkit: PCA from Ruths.ai on Vimeo.