Data Science for Executives

Table of Contents

Data science is only useful when the data are used to answer a specific, concrete question that could be useful for your organization. Data science is the process of:

formulating a quantitative question that can be answered with data
collecting and cleaning the data
analyzing the data
communicating the answer to the question to a relevant audience.

There is actually data science learning from data, discovering what is the right prediction model is. And there is implementation lumped into data engineering to scale that technology to be able to apply to a large customer base.

Data scientist	<<<—>>>	Data Engineer
Modeling	Mathematics	System impl
Visualization	Programming	DB Admin
Story-telling	Statistics	Data storage

There are always trade-offs come up in data science:

Interpretability
Accuracy
Simplicity
Fast speed (to train and test)
Scalability

Statistics

The Four key activities (kinda over lapping) that define the field:

descriptive analysis	exploratory data analysis, quantification, summarization, unsupervised clustering, etc
inference	estimation, sampling, variability, defining populations, etc
prediction	machine learning, supervised learning, etc.
design	designing experiments, A/B testing, clinical trials, etc.

Machine Learning

Machine learning is a set of algorithms that can take a set of inputs (data) and return a prediction.

Unsupervised learning	trying to uncover unobserved factors: clustering, mixture models, principal components
Supervised learning	using a collection of predictors and some observed outcomes to build an algorithms to predict the outcome when it is not observed: random forests, boosting, Support vector machines.

Machine Learning vs. Traditional Statistics

Comparing to traditional statistics, machine learning:

emphasizes on predictions
evaluates results via prediction performance
concerns a lot for over-fitting, but not model complexity per se
emphasizes on performance over population modeling and generalizability
obtains generalizability through performance on novel datasets
concerns over performance and robustness

Comparing to machine learning, traditional statistics analysis:

emphasizes not so much predictions, but superpopulation – you have a sample and want to generalize it to some superpopulation
tends to focus on a-priori hypotheses
tends to focus on simpler models over complex ones – the idea of a model seems already simpler than the idea of an algorithm
emphasize on parameter interpretability
emphasize on modeling and sampling assumptions
concerns over assumptions and robustness

Both approaches are valuable and have their place. Amount of tolerable model / algorithm complexity changes dramatically. Their goals are very different. There is a fair amount of work in making machine learning more interpretable. There is also a fair amount of work to make traditional statistics have better prediction.

Software Engineering

For data scientists, software is the generalization of a specific aspect of a data analysis. Software allows for the systematizing and the standardizing of a procedure. Software will have an interface or a set of inputs and outputs, people won’t have to worry about the gory details of what’s going on.

Structure of Projects

A typical data science project will be structured in a few different phases:

ask the question, and specify what you are interested in
exploratory data analysis
- are the data suitable?
- sketch the solution
formal modeling
interpretation
communication

Output of Experiments

Some of the most common forms of output from a data science experiment include:

report presentation	clearly written narrative concise conclusions omit the unnecessary reproducible
interactive web pages / apps	easy to use help pages or documentation code commented well version controlled

Defining Success

The most positive results are:

new knowledge is created
decisions are made based on the data
data product has impact

Another form of success is to “determine that the data just can not answer the question that you’d like to answer”.

What specifically makes a data science project unsuccessful?

all the evidence is clear, but the decisions are made in the opposite direction
results are equivocal
uncertainty prevents new knowledge

Data Science Toolbox

The tools used to store, process, analyze, and communicate results of data science experiments. Some examples are:

database	PostgreSQL
programming langage	R, python
scale up	Hadoop, Spark
communication	Slack
help websites	stackoverflow
reproducible documents	R Markdown, iPython notebook
visualization	Shiny

Hype vs Value

If you want to decide on whether a project is hype or whether is a real contribution that can really move your organization forward, ask yourself these questions:

What is the question you are trying to answer with data?
Do you really have the data to answer that question?
If you could answer the question, could you use the answer?

My Certificate

For more on Data Science for Executives, please refer to the wonderful course here https://www.coursera.org/learn/data-science-course

My #63 course certificate from Coursera

I am Kesler Zhu, thank you for visiting my website. Checkout more course reviews at https://KZHU.ai

Statistics

Machine Learning

Machine Learning vs. Traditional Statistics

Software Engineering

Structure of Projects

Output of Experiments

Defining Success

Data Science Toolbox

Hype vs Value

My Certificate

Related Posts

Kubernetes Deployment and Networking

Cloud Computing: Law Enforcement, Competition and Tax

My 13th specialization certificate from Coursera

Leave a Reply Cancel reply