The purpose of this article is to explain how to setup PCA (Principal Component Analysis)



Purpose: The purpose of PCA analysis is to convert a set of columns that are possibly correlated into a set of uncorrelated variables called principal components.  It makes sense to combine multiple columns into principal components in order to speed up the run time of a model, reduce noise or otherwise optimize a model.  PCA is a one step in the predictive modeling process.  



Prep:

  • Convert categorical variable to binary columns


Limitations: 

  • Only runs on complete case data, so is very sensitive to missing data
  • Only runs on numeric data, so categorical variables would need to be converted to binary columns



Steps to Run:

  1. Go to the Tools menu
  2. Select the Data Science option
  3. Select the PCA option
  4. Enter the data table, columns to be included and a tolerance if desired






Input Options: 

  1. Data table
  2. Columns that would be put into a predictive model (remember the PCA will attempt to combine)
  3. Option to Add All column or Clear All columns checked.
  4. Tolerance - cutoff in variance (from 0 to 1) of the first PC, reduces the number of PCs returned.  Leave blank to not enforce a tolerance cutoff.
  5. Click Ok when finishing selection.





Outputs(4 Visualization): 

  • PC Values on Original Data - Scatter Plot
  • PCA Rotation Matrix table
  • Variance explained by each PC bar chart.  There will be one bar for each principal component.  
  • PCA Plot for top components - Scatter Plot


- Note: Columns are also appended to the original data table for PC



Example:




How to Filter a subset of a data:

  1.  Open the filter panel by clicking the filter icon on the top bar 
  2.  Choose the correct filtering scheme
  3.  Click "Refresh data table" icon on PC Values on Original Data.
    - Note: This also syncs with all the other visualization on PCA.





For additional information watch video: 


Data Science Toolkit: PCA from Ruths.ai on Vimeo.