Data science is only useful when the data are used to answer a specific, concrete question that could be useful for your organization. Data science is the process of:
- formulating a quantitative question that can be answered with data
- collecting and cleaning the data
- analyzing the data
- communicating the answer to the question to a relevant audience.
There is actually data science learning from data, discovering what is the right prediction model is. And there is implementation lumped into data engineering to scale that technology to be able to apply to a large customer base.
There are always trade-offs come up in data science:
- Fast speed (to train and test)
The Four key activities (kinda over lapping) that define the field:
|exploratory data analysis, quantification, summarization, unsupervised clustering, etc
|estimation, sampling, variability, defining populations, etc
|machine learning, supervised learning, etc.
|designing experiments, A/B testing, clinical trials, etc.
Machine learning is a set of algorithms that can take a set of inputs (data) and return a prediction.
|trying to uncover unobserved factors:
clustering, mixture models, principal components
|using a collection of predictors and some observed outcomes to build an algorithms to predict the outcome when it is not observed:
random forests, boosting, Support vector machines.
Machine Learning vs. Traditional Statistics
Comparing to traditional statistics, machine learning:
- emphasizes on predictions
- evaluates results via prediction performance
- concerns a lot for over-fitting, but not model complexity per se
- emphasizes on performance over population modeling and generalizability
- obtains generalizability through performance on novel datasets
- concerns over performance and robustness
Comparing to machine learning, traditional statistics analysis:
- emphasizes not so much predictions, but superpopulation – you have a sample and want to generalize it to some superpopulation
- tends to focus on a-priori hypotheses
- tends to focus on simpler models over complex ones – the idea of a model seems already simpler than the idea of an algorithm
- emphasize on parameter interpretability
- emphasize on modeling and sampling assumptions
- concerns over assumptions and robustness
Both approaches are valuable and have their place. Amount of tolerable model / algorithm complexity changes dramatically. Their goals are very different. There is a fair amount of work in making machine learning more interpretable. There is also a fair amount of work to make traditional statistics have better prediction.
For data scientists, software is the generalization of a specific aspect of a data analysis. Software allows for the systematizing and the standardizing of a procedure. Software will have an interface or a set of inputs and outputs, people won’t have to worry about the gory details of what’s going on.
Structure of Projects
A typical data science project will be structured in a few different phases:
- ask the question, and specify what you are interested in
- exploratory data analysis
- are the data suitable?
- sketch the solution
- formal modeling
Output of Experiments
Some of the most common forms of output from a data science experiment include:
omit the unnecessary
web pages / apps
|easy to use
help pages or documentation
code commented well
The most positive results are:
- new knowledge is created
- decisions are made based on the data
- data product has impact
Another form of success is to “determine that the data just can not answer the question that you’d like to answer”.
What specifically makes a data science project unsuccessful?
- all the evidence is clear, but the decisions are made in the opposite direction
- results are equivocal
- uncertainty prevents new knowledge
Data Science Toolbox
The tools used to store, process, analyze, and communicate results of data science experiments. Some examples are:
|R Markdown, iPython notebook
Hype vs Value
If you want to decide on whether a project is hype or whether is a real contribution that can really move your organization forward, ask yourself these questions:
- What is the question you are trying to answer with data?
- Do you really have the data to answer that question?
- If you could answer the question, could you use the answer?
For more on Data Science for Executives, please refer to the wonderful course here https://www.coursera.org/learn/data-science-course
I am Kesler Zhu, thank you for visiting my website. Checkout more course reviews at https://KZHU.ai
All of your support will be used for maintenance of this site and more great content. I am humbled and grateful for your generosity. Thank you!
Don't forget to sign up newsletter, don't miss any chance to learn.
Or share what you've learned with friends!Tweet