Data never actually arrives in the exact perfect form you want it to, so you need data pipeline process to prepare data. There are 3 typical stages in data pipeline: data extraction, data transformation, and data loading, collectively known as ETL. What tools and processes you use in each of these stages and overall, is an important decision which impacts every stage of the project.

Data Extraction

First data extraction, most real life use cases for machine learning are ongoing and you’ll likely want to automate data retrieval, but this automation is rarely simple. In particular this pretty much always involves getting permission from the appropriate source. You should also consider a data privacy and intellectual property protection,, you have to ensure its warehoused according to the relevant policies. This is especially true for Cloud computing.

Gathering data all in one place might also involve scanning and pre-processing, making sure your data schema is consistent. Note that:

  1. Duplicate observations are particularly common if you’re combining data sets from multiple sources.
  2. Irrelevant observations are ones that don’t actually fit the specific problem you’re trying to solve.

There’s no way to learn anything from static signals or the data might just contain a lot more detail than you need. It’s important to think about your data in the context of your specific problem.

Think about detecting outliers and deciding how to handle them. On the one hand, they may hold key information because they’re so different from the main group. But on the other hand they may also just hold meaningless noise or mistakes that can throw off your model.

Building strong data pipelines involves a lot of programming and integration between many members on the team. It’s good to have an idea of what coding standards and testing procedures you’re going to hold to. Testing data-flow or pipelines creates challenges that are different from traditional software testing like unit testing.

For data pipelines, we need to test both:

The code in the usual wayMake sure the transformations and cleaning steps work on test cases.
The data conforms to the expected schemaIt’s best to have all data go through these sanity checks automatically, so that the downstream process can use data with confidence.

Data Transformation

You have relevant data of various clients collected from different sources altogether, however they’re still not ready to be consumed by a learning algorithm. You’ll need a good idea how to transform your data after consolidating it. For example, converting images or recordings of words into text strings as a component of data preparation, then use the output as your input for your machine learning workflow. At this stage, you need to think about converting that raw data into a consistent data format that your machine learning algorithms and analysis tools can directly and correctly read.

The next major thing we should look into is how to integrate different kinds of data to get a unified view. If you’re doing supervised learning, you can think of it as creating a standard matrix of your data, which means rows of specific examples with particular features as a column and for the learning data you need to have each example paired with the appropriate label. Sometimes this will take extra care and involvement from a domain expert. Often, you’ll find yourself relying heavily on the metadata associated with the files.

The next task is to make sure the unified dataset are relevant to the question we are trying to solve. If not, eliminate those irrelevant data before putting any additional effort onto cleaning them. Removing irrelevant data can be highly domain specific.

There are potentially many different transformational steps and it would be impossible to cover them all. Your ultimate goal is converting your raw data into a well defined structured set. Know what defines an example, and then whatever sources you use, it always comes back to fitting them into that structure match to the appropriate example.

Data Quality

Machine learning algorithms learn patterns from data. If you feed them low-quality data, the patterns will be at best low-quality, and at worst, radically misleading. Garbage in, garbage out. With garbage learning data, learning algorithm creates what could be a perfectly valid model (question answer machine). But when you go to use it, instead of having useful answers, the garbage shows up.

In data quality assurance processes, the first thing we should inspect is the source of the data. This will help us make sure the necessary domain-specific quality standards have been applied during the collection and storage process. Here domain experts play a major role.

Completeness of data is a major factor in determining the quality of our data. In this step, we’re checking that we have a certain percentage of our required data available to address our problem. Partial data (or data where some features of some examples are missing values) can still be used through data imputation techniques. But it’s best to know what’s missing and why.

Duplicate data also affects the quality of the data. Identifying and removing duplicates can be tricky. In the best case, they’re obvious. But sometimes it’s a bit harder to know when a duplicate will skew our analysis or mislead our models when it’s from errors in the data process and not a real artifact of the data. Identifying and removing duplicates has to be done with care, incorporating purely data-driven validation and summary statistics with domain knowledge.

We want to remove corrupted records, but how to identify actual corruption in the record has to be handled by domain experts, which is usually required to identify common sources of corruption or define the bounds of legitimate values. Correct those errors manually if possible, otherwise, remove them.

The next concern is to make sure that the data is valid, meaning the data we have indeed matches the requirements and structure imposed by the business problem. If the data doesn’t match the problem, there’s no point in improving quality.

When we identify that the data is appropriate for the task at hand, we can move on to check that the data is accurate. The learning data may not come from the same distribution as the operational data. Automated checks that compare the statistical profile of the learning data to the profile of the operational data can catch some things, but nothing’s perfect.

Moreover, working with legacy data or outdated data could result in outdated trends and patterns that don’t reflect current or future events being discovered by our learning algorithm. Finally, to assure the quality of our dataset, we should assess how consistent it is. Especially whether the units are consistent or whether the numerical feature ranges are the same, etc.

You may want automated quality checks and you’re going to want to have quality assurance processes that take advantage of domain expertise and work within your problem space.

How Much Data is Needed

A good machine learning model often requires training with an extremely large number of samples. The amount of data you need depends both on:

  1. the complexity of your problem – the unknown underlying function that best relates your features to the output variable.
  2. the complexity of your chosen algorithm – the number of knobs the machine learning algorithm is allowed to twiddle in its search for the best model.

Complexity is always a trade-off: greater complexity (called overfitting) means that the machine learning solution can be more accurate and precise (on seen data), but it also increases the danger that it’s going to sacrifice generality (to unseen data), fitting exactly to irrelevant details.

With perfect features closely related to correct answer, you might be able to learn from only a few examples. With more realistic features not closely related to correct answer, you need more data. Similarly, the more features it takes to accurately capture the relationship, the more data you need

How much data you need to learn a good model also depends on the algorithm used to learn the underlying mapping function. Deep learning is able to find incredibly complex functions (non-linear relationships), however simple linear classifiers can not.

So if you’re using deep learning or neural networks, you absolutely must have thousands of unique examples for every class you want to learn about. If a linear classifier achieves good performance with hundreds of examples per class, you may need tens or hundreds of thousands of examples to consistently get the same performance with a nonlinear classifier.

In pretty much every case, you want hundreds to thousands of examples to learn from, but you can use good feature engineering to somewhat compensate for your small datasets.

Types of Data

To a machine, data is numbers. At some point, all the data you want to use will have to be converted into something useful for a computer. In machine learning, three important kinds of features are:

categorical / nominalData that can be placed into unordered bins.
You should not number them, because numbering implies an artificial order.
ordinalData that can be placed into ordered bins.
When there is a meaningful way to order the categories.
continuousData that can be measured.
Ordered, with infinite possible values within a given range.

Deciding which of these kinds of data most naturally fits your examples is a big part of setting up a machine learning process. Consider one-hot encoding like (1, 0, 0), (0, 1, 0), (0, 0,1 ), when it is required to represent category names using vectors, which will remove the false ordering we had on them when we use 1, 2 and 3. In cases where we have more than one category for example, we can use a technique called multi-hot encoding.

Some of the machine learning algorithms are not affected by categorical variable transformation. Tree-based methods are not affected by one-hot encoding because they’re able to make splitting decisions with the categorical value directly. Naive-based methods are also not affected because they’re dependent on the count of the values in a class. But for most learning algorithms and even some implementations of tree-based methods, you have to do the encoding.

Types of Missing Data

Data you encounter in the real world often has missing values and data can be missing for various reasons:

Missing completely at random (MCAR)When there’s no connection whatsoever between the observed data and the missing data. Then the only thing that matters is some percentage of your data is missing.
You can throw out those incomplete examples and it won’t change your resulting model.
Unfortunately, it’s uncommon for there to be no relation at all between missing values and observed values.
Missing at random (MAR)There’s still some randomness in what values are missing but it relates to the data you actually do have. There’s some systematic difference and the missing values are influenced by something else present in the data, something you do have correlates with the values that are missing. The missing value has some conditional dependence on one of the other features.
So if you simply throw away the incomplete data, you’re changing the distribution of your training data, maybe in some important way.
Missing not at random (MNAR)There’s a consistent reason why that particular data is missing and it’s directly related to the value itself. These missing values are the most
difficult to fill in because they hold important information that you can’t necessarily recover from the data you have.
Algorithms that use such data need to accommodate missing values as part of the process.

For filling in values that are MCAR or MAR, there are a few techniques:

  1. Imputation by constant. All of the missing values of a feature will be filled in with the same value (say 0, mean, median, most frequently occurring value of the observed data) and it won’t be dependent on any other feature.
  2. Imputation by mean values pertaining to a specific value of another feature. Calculate average values separately for two groups. For example use the gender feature to compute the mean value of age separately, and fill them up accordingly.
  3. Imputation by k-nearest neighbors. We can use a distance metric and using any number of other features, find the nearest neighbor of the data point whose value is to be filled and fill it in based on the value of the neighbor. If more than one neighbor is considered, you can take either a mean of the values of the neighbors or a weighted mean based on the distance of the neighbor.
  4. Imputation by deterministic regression. Use a regression model based on the observed features and then predict the value for the unobserved feature. The predicted value is now what we use to fill in the missing numbers.
  5. Imputation by stochastic regression. A variation of deterministic regression, is to add a small random value to the predicted value and use that to replace the missing value.

What method you use to fill in missing values depends on the reasons the data is missing as well as the characteristics of the data itself. Often how you do the data imputation has a significant effect on the performance of your model.

My Certificate

For more on Prepare Your Data for Machine Learning Success, please refer to the wonderful course here

Related Quick Recap

I am Kesler Zhu, thank you for visiting my website. Check out more course reviews at

Don't forget to sign up newsletter, don't miss any chance to learn.

Or share what you've learned with friends!

Leave a Reply

Your email address will not be published. Required fields are marked *