Bad Data in Machine Learning

Table of Contents

There are many ways that data can go wrong, sometimes through no fault of its own.

Imbalanced Data

A dataset with skewed class proportions where the vast majority of your examples come from one class is called an imbalanced dataset. Not surprisingly, having imbalanced classes in your learning data impacts the model that results. You should first consider simply gathering more data. It’s possible that the imbalance can be addressed with a larger dataset. This is especially true if the imbalance isn’t inherent in the problem (like classifying rare events), but is an artifact of your data collection.

Sometimes you can combine domain knowledge with data to improve classifier performance. In addition, you might want to try different learning algorithms, as different algorithms may be more or less suited to handling class imbalance. For example, decision trees will often perform well on imbalanced datasets, while others assume an even distribution.

Resampling

We can also try resampling techniques to handle imbalanced data sets.

Oversampling	Add more copies (replicated randomly) of the minority class. Depending on the nature of your data, it might make sense to add a small amount of noise to the copies. Split data points into test and training sets *before* doing the oversampling, otherwise it will leads to overfitting and poor generalization performance.
Undersamping	Randomly removing some observations of the majority class. Downside is we’re removing information that could have been valuable, probably lead to underfitting and consequently poor generalization.

After apply such resampling techniques, you’ll have about the same number of data points for each class. Realize that using different sampling techniques can introduce bias into the data. After all, you’re deliberately changing the distribution of the data. Sometimes this is a good thing. Sometimes it isn’t. That’s why a clean test set and solid domain knowledge are so important when deciding what technique to try.

Rare Events

In some classification problems such as medical diagnosis, the imbalance is inherent in the problem, for instance 99% of people in a dataset are healthy meanwhile 1% of them have some disease. Imagine building a classification model that tries to predict whether or not someone has the disease. We can easily build a highly accurate classifier, which always guesses no disease. Then this model is right on 99% of the data points, so 99% percent accurate and 0% useful. The plain accuracy is not the best metric to use.

Precision and Recall

One way you can handle the imbalanced classes problem is to change your evaluation metric. Precision, recall, the confusion matrix, and the F1 measures are all useful metrics in cases where classes aren’t evenly distributed.

You can also come up with a cost matrix or loss function that’s a weighted combination of false positive and false negative errors with different weightings assigned to each type. Then you select the best classifier as the one that minimizes that cost matrix or loss function.

Cohen’s Kappa

You can also try to optimize something called Cohen’s Kappa. This measure adjusts for the imbalance of the classes by normalizing accuracy with the imbalanced ratio, where p_o is observed accuracy, and p_e is expected accuracy.

κ = p_o-p_e / 1-p_e = 1 - (1-p_o / 1-p_e)

ROC Curve

ROC stands for Receiver Operating Characteristic. ROC curves is a useful evaluation metric you can use. Like precision and recall, accuracy is divided into sensitivity and specificity, and models can be chosen based on the balanced thresholds of these values. An ROC plot is a two-dimensional plot with the misclassification rate (false positive rate) of one class on the x-axis, and the accuracy (true positive rate) of the other class on the y-axis.

True Positive Rate (y-axis)
  = True Positive / (True Positive + False Negative)

False Positive Rate (x-axis)
  = False Positive / (False Positive + True Negative)

An ROC plot not only preserves all performance-related information about a classifier, it also allows key relationships between the performance of several classifiers to be identified instantly by visual inspection. For instance, suppose there are 2 classifiers C₁ and C₂:

If C₁ has better accuracy, whose ROC plot will be above and to the left of the plot of C₂.
If C₁ is superior to C₂ in some circumstances but not others, their ROC plots will cross.

If interpreted correctly, ROC plots show the misclassification cost of a classifier over all possible class distributions and all possible assignments of misclassification costs.

Cost Curve

Cost curves are an alternative to ROC curves for visualizing the performance of binary classifiers. While cost curves share many of ROC curves desirable properties, they can also:

show confidence intervals on a classifier’s performance
visualize the statistical significance of the difference in the performance of two classifiers

Generalization

Machine learning is the process of generalizing from examples. That generalization is actually very limited. How well the machine learning model generalizes has more to do with thorough testing than the computer actually knowing anything. More formerly, generalization is limited by two things:

The data you feed into the system
The learning algorithm itself

Generalization is a difficult thing for machines to do. The ways machine learning differs from that human learning.

Biased Data

Biased training data leads to biased model and predictions. If the learning data does not accurately represent the operational data that you want your model to give you predictions about, the predictions will be biased.

You need to be mindful of representation bias in your training data and pay attention to collecting data from under-represented groups. This concern is not unique to machine learning, but it’s one that must be a top priority in your data collection process.

Bias and Variance of Model

In almost all cases,our models are going to make mistakes when they’re predicting labels. Errors arise due to bias and variance. The bias-variance tradeoff to minimize errors is a fundamental property of learning.

Remember, we can’t consider all possible hypotheses, we have to start with a set of hypotheses of a particular type. This makes the final model biased as no choice from this space will perfectly described the real-world. The bias of the model stems from our choice of hypothesis space via the learning algorithm and features of our data.

If our initial hypothesis space only includes models that are really simple, we will have underfitting, that is our choice of model can’t fit the training data well enough. If the model doesn’t fit the training data, it’s unlikely to do well on operational data (the stuff that model hasn’t seen). On the other hand, if we allow an excessively complicated model, the model could fit the training data really well, but perform poorly on operational data.

Underfitting	A machine learning algorithm cannot capture the underlying trend of the data well enough, the model is too simple.
Overfitting	The model learns the noise along with the signal in the training data, and so it won’t perform well on operational data.

One way to avoid overfitting is to simply refuse to allow complex models. In other words, restrict the hypothesis space to simple models.

If we have an overly simple hypothesis space, we’re avoiding overfitting, but introducing more potential for bias. But we can’t completely avoid bias by allowing infinite complexity, because that leads to errors due to overfitting.

Variance of Model

This is where variance (how much the model vary) comes in. Are the models (produced by our learning algorithm on our training data with the given features) actually consistent or do they change a lot with relatively small changes in training data?

With more complex models, we can potentially avoid bias. But the more complex the model, the more likely our approach will have high variance. What creates that variance?

Input data	Suppose repeating the whole model building process more than once. The sampled data will be different each time due to randomness. The resulting models will almost always be different for a particular input. A set of highly varying models will give inconsistent predictions.
Learning algorithm	For parameterized models, the initialization of the parameters is usually done randomly. Initialization of parameters can lead to drastically different models even on the same dataset. Different classes of learning algorithms will be sensitive to initialization to varying degrees.

We measure bias in our models by looking at the performance on our learning data. Algorithms with high bias will underfit the training data as well as perform poorly on the test or validation data.

perform poorly on test / validation data → high bias

We can measure variance by looking at how much the output changes when you repeat the learning process on a different sample of learning data. You can also take advantage of the relationship between complexity and variants and use tricks that encouraged simpler models.

Regularization which is a systematic way of making the trade-off between bias and variance of your machine learning models in order to have better generalization on new unseen data.

Outliers

Outliers are the data that looks odd or lie outside the normal range of data.

A global outlier	A data point that looks strange overall, stands out in comparison to all the points.
A local outlier	A data point that fits in the data overall, but looks strange next to its neighbors.

Outliers can arise for multiple reasons:

the inherent variability of the data
- don’t really cause problems
- oddity goes away when you increase the number of samples
errors in the process of gathering data
- need to be removed
- can cause trouble
novelties, something rare or new has actually happened
- can not remove
- if removed, lose information

Including outliers (due to errors) will degrade the performance of any model, since it’s learning from bad data. In that case, detecting and removing them can be an important data cleaning step. However, the decision to remove outliers is challenging because you usually can’t be sure that outliers are:

due to a problem during data collection, and
not due to intrinsic variability or novelties in the phenomenon that we’re modeling.

Consider the fact that when an outlier is caused by an unusual event, the outlier represents something significant. It’s representing a rare event which contains new information exactly due to its rarity. Domain knowledge can help identify absolutely impossible values. A visualization method like a boxplot (for a single feature) or scatter plot (2 or 3 features against each other) can let you easily spot outliers.

z-score

Another obvious tool for detecting outliers is looking at the z-score, which tells you how far out of whack a number is. If we assume that data comes from a normal or Gaussian distribution, the z-score for any point x is:

z = (x - samples mean) / standard deviation

the z-scores let you know when things are deviating from the standard. An outlier’s z-score is like 2.5 or higher. The z-score is a simple yet powerful method for detecting outliers in your data. Especially, if you’re dealing with not too many features and what you suspect is a parametric distribution.

For non-parametric problems, DBSCAN (density-based spatial clustering of applications with noise) which is a clustering algorithm, or isolation forests which is based on binary decision trees can also be used.

Skewed Distribution

Training data distribution has to match your operational data distribution, otherwise, you’re in trouble. But in practice, not only the data itself, but also the collection and processing of the data can meddle with the distribution.

A skewed label distribution is usually due to working with a minority class (imbalanced data). Most algorithms make an internal assumption about the distribution of the labels. This means that the labeling your dataset needs to match the assumption. And you should pick the regression type that best matches the label set. For instance, you may choose a Poisson regression over a Gaussian regression, if you have reason to believe that the Poisson distribution is more likely to be the data generating distribution underlying the task you’re working on.

More often when we talk about skewed distributions in machine learning, we’re concerned with the feature space.

The distributional shift in the feature space from your training data to your testing data puts you into a different learning regime. You’re making predictions and asking questions about a distribution that’s different than the one you learned on. These contextual differences can mean the distribution of data is changed.

When the distributions across features vary wildly in terms of magnitudes, units, and range, this also disrupts machine learning performance. Techniques to standardize or normalize them, are correcting the skewed distributions between features.

Individual features can have skewed distributions as well. The specific feature values might be distributed asymmetrically around the mean. This may or may not impact performance, depending on which algorithm you’re using. Various data transformations, like the logarithmic transform or polynomial feature expansions can correct for skew in individual features.

Moreover, if a particular feature has a lot of missing values in the data, we’re likely going to introduce skew. Since we usually don’t know the real distribution, techniques of adding missing values will likely skew the distribution.

You can identify skewed distributions by visualizing your features. Visualization lets you see how skewed your features are, and if that skew is significant. If the skew is significant, you can use a feature scaling approach like standardization or normalization to manipulate the feature distribution. Standardization and normalization work in the majority of cases. If these scaling methods don’t work, you may have to do something trickier, like change the feature representation.

Sometimes you won’t have a way of making the feature distribution well-behaved. The next best thing in that case is to choose algorithms that are less sensitive to feature distributions. You have two options:

Make an informed decision using what you know about the distribution to select an algorithm that suits most.
Empirically test different algorithms. (In machine learning, this approach is often the more favored one.)

You should ever care about the underlying distribution. Especially when your data set is small and you’re trying to make concrete inferences about new data, understanding the data distribution can really help.

Consequence of Bad Data

If we don’t fix the issues we encounter in data, the issues propagate. We’re less likely to have success in building the model and less likely to see it successfully operate.

Correct and high quality labels (class labels or real number targets) are key to training a good supervised learning model. Wrong labels will obviously mislead the learning algorithm, and can even break it entirely. On the other hand, we have to be aware of situations where the labels are misleading. Wrong labels or misleading labels can be major hindrances in training a good quality model. Outlier labels are another example for cases where early on small errors become huge issues later.

Live data can be different from training data in spite of your best efforts. There are some unavoidable differences between live data and learning data.

All pre-processing of live data has to be automated.
The live data is coming in on its own schedule.
The live data is not as one big batch.

This leads to a couple of other things to consider regarding your data pipeline. People have a tendency to change things without thinking about the larger impact of their changes. You want to systematically detect such changes. So you have to have some check built-in so that you know that your data transformation is still accurately representing the data that flowed in originally.

My Certificate

For more on Bad Data in Machine Learning, please refer to the wonderful course here https://www.coursera.org/learn/data-machine-learning

My #114 certificate from Coursera

Related Quick Recap

Building Good Features for Machine Learning

I am Kesler Zhu, thank you for visiting my website. Check out more course reviews at https://KZHU.ai