Statistics: Arts and Sciences
Statistics is the subject that encompasses all aspects of learning from data. We are talking about tools and methods to allow us to work with data to understand that data. Statisticians apply and develop data analysis methods seeking to understand their properties. Researchers and workers apply and extend statistical methodology. Statistics is certainly a very evolving and dynamic field.
There are different schools of thought about the field of statistics:
|art of summarizing data
|make a data set comprehensible to human observer
depend on goals of “data consumer”
|science of uncertainty
|quantify how far our reported findings may fall from “the truth”
|science of decisions
|ultimate goal of any statistical analysis
|science of variation
|emphasis on understanding variation in data
|art of forecasting
|make predictions about future
|science of measurement
|measure difficult-to-define concepts and in accessing quality
|basis for principled data collection
|a rational way to manage the trade-offs due to data are expensive to collect and resource limitations
Data can be numbers, images, words, audio. In general, there are 2 key types of data:
|organic / process data
|generated organically as the result of some process over time.
e.g: financial or point-of-sale transactions, stock market exchanges, etc.
|“designed” data collection
|designed to specifically address a stated research objective.
e.g: individuals sampled from a population.
Big data really refers to these types of datasets that are coming from organic processes. The processing of the data requires significant computational resources, data scientist mine these data to study trends and uncover interesting relationships.
The “designed” data are usually sampling from populations, administration of carefully designed questions. They are generally much smaller compared to organic and process datasets. They are collected for very specific reasons, rather than simple reflections of ongoing natural process.
Observations on a variable of interest in the i.i.d. (independent identically distributed) case are completely independent of all observations. So there is no correlation between the different measurements:
|All those observations are independent of all the other observations that might ultimately make up a data set.
|All of the values that we actually observe arise from some common statistical distribution.
Given these i.i.d. data, we can estimate the features of that distribution with a certain amount of precision: mean, variance, extreme percentile; and make inference about those features.
In the cases when data are not i.i.d. we need to account for those dependencies and those differences in the analysis that we are performing. We may need different analytic procedures when data are not i.i.d.
A quantitative variable is simply a numerical measurable quantity in which arithmetic operations often make sense. A continuous quantitative variable could take on any value within an interval and could be many possible values. A discrete quantitative variable is a finite set of countable numbers.
A categorical (or qualitative) variable simply classifies individuals or items into different groups. An ordinal categorical variable simply has some sort of order or ranking associated with it. A nominal variable does not have any associated ranking with it.
There is a spectrum of study design options: exploratory data analysis, all way up to highly planed efforts for a specific question. Types of research studies include:
- Exploratory vs. Confirmatory
- Comparative vs. Non-comparative
- Observational studies vs. Experiments
Confirmatory research is scientific methods that specify falsifiable hypothesis, then test it, by collecting data to address pre-specific questions. In comparison, Exploratory research is to collect and analyze data without first pre-specifying questions.
Comparative research’s goal is to contrast one quantitity to another. Non-comparative studies focus on estimating or predicting absolute quantities, not explicitly comparative.
Observational data arises naturally, observational studies often say subjects are exposed to a condition rather than being assigned (passive or self-selected when it is impractical or unethical to select). Experiments involve manipulation of assignment of the subjects to treatment arms.
Power analysis is the process to assess whether a study design is likely to yield meaningful findings. Bias are measurements that are systematically off-target, or samples that are not representative of population of interest.
Categorical (qualitative) data
The best way of summarizing categorical data is usually just:
- a frequency table with either counts or percentages or both.
- bar charts are a great method of visualization.
- choose pie charts with caution.
Histogram (for quantitative data) is different from bar charts (for categorical data). Histograms are a great first analysis of our data. We can get a quick view of what our data looks like and what we might want to go on to analyze with out data. There are 4 main aspects of a histogram diagram:
|Overall appearance of histogram, which can be symmetric, bell-shaped, left skewed (mean < median), right skewed (mean > median), etc.
|Mean or median.
|How far our data reaches (spreads): range, interquartile range (IQR), standard deviation, variance.
Range = max – min
Interquartile range = 3rd quartile – 1st quartile
Standard deviation: approximately the average distance that our data points fall from the mean.
|Data points fall far from the bulk of the data.
Recall that the standard deviation is roughly the average distance that our values are from the mean. For normal distribution, 68% of values are expected to be in 1 standard deviation away from the mean; 95% of values if 2 standard deviations; 99.7% of values if 3 standard deviations. This is also called “68-95-99.5 rule” or “empirical rule”. A good way to get an estimate of how unusual a value is, is by calculating “standard score” or “z-score”.
standard score = ( value - mean ) / standard deviation
Boxplots provide us with a very nice graphic picture of 5-number summary: min, 1st quartile, median, 3rd quartile, max. It has an algorithm and rule for identifying outliers. One drawback is it hides features like gaps and clusters in the distribution. But boxplots displayed side-by-side are really useful for making comparisons.
Identifying associations with multivariate categorical data is important. Multivariate categorical data is with multiple variables and each of them is categorical. You could record each piece of the data as a row in a spreadsheet, with columns corresponding to variables. There are many other ways to display the data in a more manageable formats.
- Univariate categorical data table (one variable displayed)
- Two-way table or contingency table (two variables displayed)
- Univariate bar chart to display marginal distribution
- Two Univariate bar charts to display conditional distributions
- Side-by-side bar chart
- Stacked bar chart
- Mosaic Plot
Multivariate means more than one trait is recorded per unit. Quantitative means it takes on a measured numeric value. There are multiple ways to visualize the data:
- Univariate Histogram (one variable)
- Scatter plot to display association among different variables.
- Linear association
- Quadratic association
- No association
- Direction: positive or negative
- Strength: weak, moderate, strong
There is a way to quantify both the strength and sign of a linear association: Person Correlation, the number between -1 and 1 indicating the strength and sign of association between 2 variables. Note: Correlation does not imply causation.
Outliers are those strongly deviate from the patterns and the rest of the data. Sometimes categorical data can also be integrated with multivariate quantitative data.
For more on Visualizing Statistical Data, please refer to the wonderful course here https://www.coursera.org/learn/understanding-visualization-data
Related Quick Recap
I am Kesler Zhu, thank you for visiting my website. Check out more course reviews at https://KZHU.ai
All of your support will be used for maintenance of this site and more great content. I am humbled and grateful for your generosity. Thank you!
Don't forget to sign up newsletter, don't miss any chance to learn.
Or share what you've learned with friends!Tweet