We often get overwhelmed by the way to represent the data, so here is the basic thought process on why doing and while doing visualization.
Exploratory vs Explanatory
Data in it’s raw format is neither beautiful nor insightful. In order to get hold of the underlying distribution, anomalies and insights we need to do exploratory data analysis, popularly known as EDA. So, the exploratory part of the data is how to infer data.
Now, showing the insights in plain tabular format or without any visual representation makes the insight lost and the end user find’s it difficult to infer. Hence, the explanatory part of the data is, how to do storytelling in a visual way once you have the insights.
When should I actually plot?
- Exploratory — not able to see the patterns clearly even if I have the numbers.
- Explanatory — I have the numbers, I can understand pattens, but now I want to present it to the audience.
What is Visualisation?
The science of data visualization:
- Anatomy of a plot/ chart.
- Asking the right question before creating a plot.
- How to use the right plot/ chart for a given data?
The art in data visualization:
- Choose the right scale, labels, tick labels.
- Identify and remove clutters in the plot.
- Choose the right font size.
- Ways to highlight information in the plot.
How to choose the right plot?
- How many variables are involved? 1V, 2V, Multiple V — Science
- For every V, types of variables? cont. or cat (ordinal, nominal) — Science
- What do I want to see? — Art
Types of Analysis
Broadly we can divide our analysis into three parts:
- 1 Variable — Univariate data analysis
- 2 Variables — Bivariate data analysis
- 2+ Variables — Multivariate data analysis (Bivariate is also a part of multivariate)
Now based on the combinations of variable type and analysis, we can build our thought process for visualisation. So, let the ball rolling 😃
A univariate analysis can be done on either categorical or numerical variable. Things you may want to know about a categorical variable
- Frequency of each class in the categorical variable
- Frequencies are in different scale, so we may want to know about relative freq., percentage/proportions.
Prefer bar-plot as it is easier to compare horizontally and also vertical bar-plot only when variable are too much to fit into the screen horizontally.
Things you may want to know about a numerical variable?
- Distribution of my data
- Some statistics, aggregates about data
- Check for skewness or look for outliers
- Box-plot (five point summary)
- KDE plot (probability/kenral density estimate)
Bi-variate Visualisation (/multi-variate)
Many to many (bill vs tip): Scatter plot
One to one (time vs temperature): Line plot
Stacked bar-plot (only when the sum of variables makes sense like marks in each subject)
Heat-map/ Contingency matrix (confusion matrix in ML)
Multiple KDE plots
We can make use of color, shape or size to represent the third variable.
- Numerical-numerical-numerical: Scatter plot with third variable as the size of the dots.
- Numerical-numerical-categorical: Scatter plot with colour/ shape as third categorical variable or line-plot with colour.
- Numerical-categorical-categorical: Side-by-side dodged box plots, bar plots, violin plots.
- Categorical-categorical-categorical: Contingency matrix with third variable as colour or clustered stacked bar plot.