The purpose of this article is to explain Data Description
What is Data Description in Data Science Toolkit?
Purpose
The purpose of this extension is to provide the user with a statistical summary of their data that can be used to determine what data preparation tasks are necessary for further analysis.
Outputs:( No Transformations)
- The output is a single table with 17 columns of statistical summary and distribution characteristic of each numeric column.
- The first column (column) specifies the column being analyzed.
- The second column (count) counts the number of records, which should be the same number for all columns, as it does not exclude nulls.
- The third column (type) list the data type of the specified column.
- The fourth column (num.na) counts how many cells have nulls.
- The remaining columns describe the data in terms of min, max, median, and other statistical measures.
With Transformation
- The output is a single table with 17 columns of statistical summary and distribution characteristic of each numeric column. (same as above)
- 1 table visualization (5 columns) for transformation results for normality.
Note: Tables will be added to a new data table to use for further evaluation.
Example: (No Transformation)
Example: ( With Transformation below)
Data Science Toolkit Data Description User Guide: How to set up Data Description
See Data Description video below
Data Science Toolkit: Data Description from Ruths.ai on Vimeo.
For additional information on RAI Data Science Toolkit documentation, click here.