Data science is only useful when the data are used to answer a specific, concrete question that could be useful for your organization. Data science is the process of:

  1. formulating a quantitative question that can be answered with data
  2. collecting and cleaning the data
  3. analyzing the data
  4. communicating the answer to the question to a relevant audience.


There is actually data science learning from data, discovering what is the right prediction model is. And there is implementation lumped into data engineering to scale that technology to be able to apply to a large customer base.

Data scientist<<<—>>>Data Engineer
ModelingMathematicsSystem impl
VisualizationProgrammingDB Admin
Story-tellingStatisticsData storage

There are always trade-offs come up in data science:

  • Interpretability
  • Accuracy
  • Simplicity
  • Fast speed (to train and test)
  • Scalability

Statistics

The Four key activities (kinda over lapping) that define the field:

descriptive analysisexploratory data analysis, quantification, summarization, unsupervised clustering, etc
inferenceestimation, sampling, variability, defining populations, etc
predictionmachine learning, supervised learning, etc.
designdesigning experiments, A/B testing, clinical trials, etc.

Machine Learning

Machine learning is a set of algorithms that can take a set of inputs (data) and return a prediction.

Unsupervised learningtrying to uncover unobserved factors:
clustering, mixture models, principal components
Supervised learningusing a collection of predictors and some observed outcomes to build an algorithms to predict the outcome when it is not observed:
random forests, boosting, Support vector machines.

Machine Learning vs. Traditional Statistics

Comparing to traditional statistics, machine learning:

  1. emphasizes on predictions
  2. evaluates results via prediction performance
  3. concerns a lot for over-fitting, but not model complexity per se
  4. emphasizes on performance over population modeling and generalizability
  5. obtains generalizability through performance on novel datasets
  6. concerns over performance and robustness

Comparing to machine learning, traditional statistics analysis:

  1. emphasizes not so much predictions, but superpopulation – you have a sample and want to generalize it to some superpopulation
  2. tends to focus on a-priori hypotheses
  3. tends to focus on simpler models over complex ones – the idea of a model seems already simpler than the idea of an algorithm
  4. emphasize on parameter interpretability
  5. emphasize on modeling and sampling assumptions
  6. concerns over assumptions and robustness

Both approaches are valuable and have their place. Amount of tolerable model / algorithm complexity changes dramatically. Their goals are very different. There is a fair amount of work in making machine learning more interpretable. There is also a fair amount of work to make traditional statistics have better prediction.



Software Engineering

For data scientists, software is the generalization of a specific aspect of a data analysis. Software allows for the systematizing and the standardizing of a procedure. Software will have an interface or a set of inputs and outputs, people won’t have to worry about the gory details of what’s going on.

Structure of Projects

A typical data science project will be structured in a few different phases:

  1. ask the question, and specify what you are interested in
  2. exploratory data analysis
    • are the data suitable?
    • sketch the solution
  3. formal modeling
  4. interpretation
  5. communication

Output of Experiments

Some of the most common forms of output from a data science experiment include:

report
presentation
clearly written
narrative
concise conclusions
omit the unnecessary
reproducible
interactive
web pages / apps
easy to use
help pages or documentation
code commented well
version controlled

Defining Success

The most positive results are:

  1. new knowledge is created
  2. decisions are made based on the data
  3. data product has impact

Another form of success is to “determine that the data just can not answer the question that you’d like to answer”.

What specifically makes a data science project unsuccessful?

  1. all the evidence is clear, but the decisions are made in the opposite direction
  2. results are equivocal
  3. uncertainty prevents new knowledge

Data Science Toolbox

The tools used to store, process, analyze, and communicate results of data science experiments. Some examples are:

databasePostgreSQL
programming langageR, python
scale upHadoop, Spark
communicationSlack
help websitesstackoverflow
reproducible documentsR Markdown, iPython notebook
visualizationShiny

Hype vs Value

If you want to decide on whether a project is hype or whether is a real contribution that can really move your organization forward, ask yourself these questions:

  1. What is the question you are trying to answer with data?
  2. Do you really have the data to answer that question?
  3. If you could answer the question, could you use the answer?


My Certificate

For more on Data Science for Executives, please refer to the wonderful course here https://www.coursera.org/learn/data-science-course


I am Kesler Zhu, thank you for visiting my website. Checkout more course reviews at https://KZHU.ai

Don't forget to sign up newsletter, don't miss any chance to learn.

Or share what you've learned with friends!

Leave a Reply

Your email address will not be published. Required fields are marked *