Data Science Workflow

An important first step is to ask good questions about the data. It helps tune data into actionable information. First question can be “What happened?” And then the second might be “Why did this happen?” At this point, you can go even further and ask “What would happen next?” Finally you may ask “What should be done about it?”

Analyzing data does not scale very well. You don’t need to be come a seasoned programming either to become a data scientist. MATLAB is both an environment interacting with data and a programming language. You will use live script to analyze data.

A typical data science project comprises 3 stages: Data Analysis, Machine Learning and Results. The goal of Data Analysis is to learn more about your data before trying to learn from it. Machine Learning is the process of using algorithm to model the relationship between variables and observations. Finally you start working with Results.



Importing Data

Once you have identified the data source, next step is access and import data into MATLAB. Preparing data for analysis can be a major challenge in Data Science. In MATLAB data is organized into rows and columns. A default behavior is to assume each variable a column vector with each observation in its own row. When necessary the third dimension sheets are used. There are a few commonly used data types, e.g. double, string, categorical, datetime. A table itself is a datatype called ‘table’. You may generate code to import data automatically. It helps save time and share with others.

Visualizing Data

MATLAB offers tools to visualize, select, and modify data. You can capture the code generated by MATLAB and add it to a Live Script. Creating visualization is a great way to gain insight into what data contains. Capturing generated code is really important because you typically will try varieties of approaches, and retrace your steps.

Computations

In MATLAB, you can create a vector by entering sequence of values, placed in squared brackets, separating the values by commas. You can create uniformly spaced vector by using colon operator. Element-wise operators include .* ./ and .^. When adding or subtracting scalar, the scalar is automatically expanded to match the size of the vector before performing addition and subtraction.

Descriptive statistics provide a convenient way to summarize data sets that may contain millions of values. You can use summary function to take a quick overview of each variable. mode function returns the most frequent values in array. The omitnan parameters used in mean, median functions can help remove NaN values.

Pearson correlation coefficient is used to describe the relationship between 2 variables. Use the corn function in MATLAB. The magnitude of the correlation coefficient is NOT related to the slope of a linear relationship. Only two close the data points are to falling on a line. Further more, a small correlation coefficient is only indicative of a weak linear relationship. A strong non-linear relationship may still result in a coefficient close to zero. When you need to select elements in a vector or matrix, use conditions.

Categorical data

A categorical variable always have a finite set of discrete categories. To reorder categories, use reordercats function with a vector of strings that contains the name of categories. The function removecats removes unnecessary categories. Use functions addcats, renamecats, mergecats to merge existing categories into a new category. Calling groupsummary function (with table name and grouping variable) returns a table containing different categories and number of element in each category. If you want to calculate other statistic values, you need to specify the method and variables to apply the computations.

When visualizing data, use hold on and hold off to fit multiple plots in the same figure.

Live Scripts

Documentation is important in any data science project. First it allows you to reuse your work. Second it helps others understand your work. Third it helps represent the result. More importantly, presenting to non-technical is an essential skill for any data science. Adding control easily identify the variables to modify and makes analysis quicker and easier. It also helps assist with selection of the appropriate values.



My Certificate

For more on exploratory data analysis with MATLAB, please refer to the wonderful course here https://www.coursera.org/learn/exploratory-data-analysis-matlab


I am Kesler Zhu, thank you for visiting. Check out all of my course reviews at https://KZHU.ai

Don't forget to sign up newsletter, don't miss any chance to learn.

Or share what you've learned with friends!

Leave a Reply

Your email address will not be published. Required fields are marked *