Statistics: Arts and Sciences

Statistics is the subject that encompasses all aspects of learning from data. We are talking about tools and methods to allow us to work with data to understand that data. Statisticians apply and develop data analysis methods seeking to understand their properties. Researchers and workers apply and extend statistical methodology. Statistics is certainly a very evolving and dynamic field.

There are different schools of thought about the field of statistics:

art of summarizing datamake a data set comprehensible to human observer
depend on goals of “data consumer”
science of uncertaintyquantify how far our reported findings may fall from “the truth”
science of decisionsultimate goal of any statistical analysis
science of variationemphasis on understanding variation in data
art of forecastingmake predictions about future
science of measurementmeasure difficult-to-define concepts and in accessing quality
basis for principled data collectiona rational way to manage the trade-offs due to data are expensive to collect and resource limitations


Data

Data can be numbers, images, words, audio. In general, there are 2 key types of data:

organic / process datagenerated organically as the result of some process over time.
e.g: financial or point-of-sale transactions, stock market exchanges, etc.
“designed” data collectiondesigned to specifically address a stated research objective.
e.g: individuals sampled from a population.

Big data really refers to these types of datasets that are coming from organic processes. The processing of the data requires significant computational resources, data scientist mine these data to study trends and uncover interesting relationships.

The “designed” data are usually sampling from populations, administration of carefully designed questions. They are generally much smaller compared to organic and process datasets. They are collected for very specific reasons, rather than simple reflections of ongoing natural process.

Observations on a variable of interest in the i.i.d. (independent identically distributed) case are completely independent of all observations. So there is no correlation between the different measurements:

independentAll those observations are independent of all the other observations that might ultimately make up a data set.
identically distributedAll of the values that we actually observe arise from some common statistical distribution.

Given these i.i.d. data, we can estimate the features of that distribution with a certain amount of precision: mean, variance, extreme percentile; and make inference about those features.

In the cases when data are not i.i.d. we need to account for those dependencies and those differences in the analysis that we are performing. We may need different analytic procedures when data are not i.i.d.

Variable Types

A quantitative variable is simply a numerical measurable quantity in which arithmetic operations often make sense. A continuous quantitative variable could take on any value within an interval and could be many possible values. A discrete quantitative variable is a finite set of countable numbers.

A categorical (or qualitative) variable simply classifies individuals or items into different groups. An ordinal categorical variable simply has some sort of order or ranking associated with it. A nominal variable does not have any associated ranking with it.

Study Design

There is a spectrum of study design options: exploratory data analysis, all way up to highly planed efforts for a specific question. Types of research studies include:

  1. Exploratory vs. Confirmatory
  2. Comparative vs. Non-comparative
  3. Observational studies vs. Experiments

Confirmatory research is scientific methods that specify falsifiable hypothesis, then test it, by collecting data to address pre-specific questions. In comparison, Exploratory research is to collect and analyze data without first pre-specifying questions.

Comparative research’s goal is to contrast one quantitity to another. Non-comparative studies focus on estimating or predicting absolute quantities, not explicitly comparative.

Observational data arises naturally, observational studies often say subjects are exposed to a condition rather than being assigned (passive or self-selected when it is impractical or unethical to select). Experiments involve manipulation of assignment of the subjects to treatment arms.

Power analysis is the process to assess whether a study design is likely to yield meaningful findings. Bias are measurements that are systematically off-target, or samples that are not representative of population of interest.



Univariate Data

Categorical (qualitative) data

The best way of summarizing categorical data is usually just:

  1. a frequency table with either counts or percentages or both.
  2. bar charts are a great method of visualization.
  3. choose pie charts with caution.

Quantitative data

Histogram (for quantitative data) is different from bar charts (for categorical data). Histograms are a great first analysis of our data. We can get a quick view of what our data looks like and what we might want to go on to analyze with out data. There are 4 main aspects of a histogram diagram:

ShapeOverall appearance of histogram, which can be symmetric, bell-shaped, left skewed (mean < median), right skewed (mean > median), etc.
CenterMean or median.
SpreadHow far our data reaches (spreads): range, interquartile range (IQR), standard deviation, variance.
Range = max – min
Interquartile range = 3rd quartile – 1st quartile
Standard deviation: approximately the average distance that our data points fall from the mean.
OutliersData points fall far from the bulk of the data.

Recall that the standard deviation is roughly the average distance that our values are from the mean. For normal distribution, 68% of values are expected to be in 1 standard deviation away from the mean; 95% of values if 2 standard deviations; 99.7% of values if 3 standard deviations. This is also called “68-95-99.5 rule” or “empirical rule”. A good way to get an estimate of how unusual a value is, is by calculating “standard score” or “z-score”.

standard score = ( value - mean ) / standard deviation

Boxplots provide us with a very nice graphic picture of 5-number summary: min, 1st quartile, median, 3rd quartile, max. It has an algorithm and rule for identifying outliers. One drawback is it hides features like gaps and clusters in the distribution. But boxplots displayed side-by-side are really useful for making comparisons.

Multivariate Data

Categorical Data

Identifying associations with multivariate categorical data is important. Multivariate categorical data is with multiple variables and each of them is categorical. You could record each piece of the data as a row in a spreadsheet, with columns corresponding to variables. There are many other ways to display the data in a more manageable formats.

  1. Univariate categorical data table (one variable displayed)
  2. Two-way table or contingency table (two variables displayed)
  3. Univariate bar chart to display marginal distribution
  4. Two Univariate bar charts to display conditional distributions
  5. Side-by-side bar chart
  6. Stacked bar chart
  7. Mosaic Plot

Quantitative Data

Multivariate means more than one trait is recorded per unit. Quantitative means it takes on a measured numeric value. There are multiple ways to visualize the data:

  1. Univariate Histogram (one variable)
  2. Scatter plot to display association among different variables.
    • Linear association
    • Quadratic association
    • No association
    • Direction: positive or negative
    • Strength: weak, moderate, strong

There is a way to quantify both the strength and sign of a linear association: Person Correlation, the number between -1 and 1 indicating the strength and sign of association between 2 variables. Note: Correlation does not imply causation.

Outliers are those strongly deviate from the patterns and the rest of the data. Sometimes categorical data can also be integrated with multivariate quantitative data.



My Certificate

For more on Visualizing Statistical Data, please refer to the wonderful course here https://www.coursera.org/learn/understanding-visualization-data



I am Kesler Zhu, thank you for visiting my website. Check out more course reviews at https://KZHU.ai

Don't forget to sign up newsletter, don't miss any chance to learn.

Or share what you've learned with friends!

Leave a Reply

Your email address will not be published. Required fields are marked *