Visualizing Statistical Data

Table of Contents

Statistics: Arts and Sciences

Statistics is the subject that encompasses all aspects of learning from data. We are talking about tools and methods to allow us to work with data to understand that data. Statisticians apply and develop data analysis methods seeking to understand their properties. Researchers and workers apply and extend statistical methodology. Statistics is certainly a very evolving and dynamic field.

There are different schools of thought about the field of statistics:

art of summarizing data	make a data set comprehensible to human observer depend on goals of “data consumer”
science of uncertainty	quantify how far our reported findings may fall from “the truth”
science of decisions	ultimate goal of any statistical analysis
science of variation	emphasis on understanding variation in data
art of forecasting	make predictions about future
science of measurement	measure difficult-to-define concepts and in accessing quality
basis for principled data collection	a rational way to manage the trade-offs due to data are expensive to collect and resource limitations

Data

Data can be numbers, images, words, audio. In general, there are 2 key types of data:

organic / process data	generated organically as the result of some process over time. e.g: financial or point-of-sale transactions, stock market exchanges, etc.
“designed” data collection	designed to specifically address a stated research objective. e.g: individuals sampled from a population.

Big data really refers to these types of datasets that are coming from organic processes. The processing of the data requires significant computational resources, data scientist mine these data to study trends and uncover interesting relationships.

The “designed” data are usually sampling from populations, administration of carefully designed questions. They are generally much smaller compared to organic and process datasets. They are collected for very specific reasons, rather than simple reflections of ongoing natural process.

Observations on a variable of interest in the i.i.d. (independent identically distributed) case are completely independent of all observations. So there is no correlation between the different measurements:

independent	All those observations are independent of all the other observations that might ultimately make up a data set.
identically distributed	All of the values that we actually observe arise from some common statistical distribution.

Given these i.i.d. data, we can estimate the features of that distribution with a certain amount of precision: mean, variance, extreme percentile; and make inference about those features.

In the cases when data are not i.i.d. we need to account for those dependencies and those differences in the analysis that we are performing. We may need different analytic procedures when data are not i.i.d.

Variable Types

A quantitative variable is simply a numerical measurable quantity in which arithmetic operations often make sense. A continuous quantitative variable could take on any value within an interval and could be many possible values. A discrete quantitative variable is a finite set of countable numbers.

A categorical (or qualitative) variable simply classifies individuals or items into different groups. An ordinal categorical variable simply has some sort of order or ranking associated with it. A nominal variable does not have any associated ranking with it.

Study Design

There is a spectrum of study design options: exploratory data analysis, all way up to highly planed efforts for a specific question. Types of research studies include:

Exploratory vs. Confirmatory
Comparative vs. Non-comparative
Observational studies vs. Experiments

Confirmatory research is scientific methods that specify falsifiable hypothesis, then test it, by collecting data to address pre-specific questions. In comparison, Exploratory research is to collect and analyze data without first pre-specifying questions.

Comparative research’s goal is to contrast one quantitity to another. Non-comparative studies focus on estimating or predicting absolute quantities, not explicitly comparative.

Observational data arises naturally, observational studies often say subjects are exposed to a condition rather than being assigned (passive or self-selected when it is impractical or unethical to select). Experiments involve manipulation of assignment of the subjects to treatment arms.

Power analysis is the process to assess whether a study design is likely to yield meaningful findings. Bias are measurements that are systematically off-target, or samples that are not representative of population of interest.

Univariate Data

Categorical (qualitative) data

The best way of summarizing categorical data is usually just:

a frequency table with either counts or percentages or both.
bar charts are a great method of visualization.
choose pie charts with caution.

Quantitative data

Histogram (for quantitative data) is different from bar charts (for categorical data). Histograms are a great first analysis of our data. We can get a quick view of what our data looks like and what we might want to go on to analyze with out data. There are 4 main aspects of a histogram diagram:

Shape	Overall appearance of histogram, which can be symmetric, bell-shaped, left skewed (mean < median), right skewed (mean > median), etc.
Center	Mean or median.
Spread	How far our data reaches (spreads): range, interquartile range (IQR), standard deviation, variance. Range = max – min Interquartile range = 3rd quartile – 1st quartile Standard deviation: approximately the average distance that our data points fall from the mean.
Outliers	Data points fall far from the bulk of the data.

Recall that the standard deviation is roughly the average distance that our values are from the mean. For normal distribution, 68% of values are expected to be in 1 standard deviation away from the mean; 95% of values if 2 standard deviations; 99.7% of values if 3 standard deviations. This is also called “68-95-99.5 rule” or “empirical rule”. A good way to get an estimate of how unusual a value is, is by calculating “standard score” or “z-score”.

standard score = ( value - mean ) / standard deviation

Boxplots provide us with a very nice graphic picture of 5-number summary: min, 1st quartile, median, 3rd quartile, max. It has an algorithm and rule for identifying outliers. One drawback is it hides features like gaps and clusters in the distribution. But boxplots displayed side-by-side are really useful for making comparisons.

Multivariate Data

Categorical Data

Identifying associations with multivariate categorical data is important. Multivariate categorical data is with multiple variables and each of them is categorical. You could record each piece of the data as a row in a spreadsheet, with columns corresponding to variables. There are many other ways to display the data in a more manageable formats.

Univariate categorical data table (one variable displayed)
Two-way table or contingency table (two variables displayed)
Univariate bar chart to display marginal distribution
Two Univariate bar charts to display conditional distributions
Side-by-side bar chart
Stacked bar chart
Mosaic Plot

Quantitative Data

Multivariate means more than one trait is recorded per unit. Quantitative means it takes on a measured numeric value. There are multiple ways to visualize the data:

Univariate Histogram (one variable)
Scatter plot to display association among different variables.
- Linear association
- Quadratic association
- No association
- Direction: positive or negative
- Strength: weak, moderate, strong

There is a way to quantify both the strength and sign of a linear association: Person Correlation, the number between -1 and 1 indicating the strength and sign of association between 2 variables. Note: Correlation does not imply causation.

Outliers are those strongly deviate from the patterns and the rest of the data. Sometimes categorical data can also be integrated with multivariate quantitative data.

My Certificate

For more on Visualizing Statistical Data, please refer to the wonderful course here https://www.coursera.org/learn/understanding-visualization-data

My #77 course certificate from Coursera

Multilevel Models vs Marginal Models

I am Kesler Zhu, thank you for visiting my website. Check out more course reviews at https://KZHU.ai

Statistics: Arts and Sciences

Data

Variable Types

Study Design

Univariate Data

Categorical (qualitative) data

Quantitative data

Multivariate Data

Categorical Data

Quantitative Data

My Certificate

Related Quick Recap

Related Posts

Kubernetes Deployment and Networking

Cloud Computing: Law Enforcement, Competition and Tax

My 13th specialization certificate from Coursera

Leave a Reply Cancel reply