Having a deep understanding of data is an essential prerequisite for doing EDA (Exploratory Data Analysis) as well as feature engineering. Very often, only certain types of feature engineering techniques are valid for certain types of data. Why do we need feature engineering? In order to visualize it better and to use it as features in machine learning models, you may need to encode your data differently.
Algorithms often require your data to be in a certain form for them to be able to learn effectively. Most of the algorithms and machine learning have underlying assumptions about the data where they work best. Understanding these assumptions and transforming the data accordingly is the key to building an optimal machine learning model.
Binning is one of the simplest feature engineering technique, it is used for transformation of continuous numerical values into discrete values or categories. The easiest kind of binning simply buckets values according to some fixed criteria, making buckets of the same size, or using some standard domain specific thresholds. This approach is useful when you know the fine details aren’t really important and the learning algorithm will treat a range of values similarly.
Since binning leads to discrete categorical features, we probably want an additional step of feature engineering on those categorical features, like one-hot encoding. Or treat the bins as ordinal features and assign them integer values accordingly.
We can also apply some scaling techniques to deal with skewed data. Examples are log transformation, z-score normalization, min-max, etc.
Useful and Useless Features
One of the best ways to identify which features could be useful or not, is by talking to a domain expert. But there’s a difference between the features that give a good performance boost to a machine learning model and features that are useful for human experts. You aren’t required to listen completely and only to domain experts.
Humans identify features that makes sense given what we think about the context of the problem and the underlying processes machines don’t discriminate. Machine learning can identify real correlations that hold predictive power, but which just aren’t useful for the ways we humans can understand and interact with the world. We just approach the problem in different ways.
Machines and humans are different, this is good and bad:
- Sometimes machine learning creates nonsense outcomes from label leakage or random effects.
- Sometimes it picks up on artifacts of the training data that hurt operational performance.
- Sometimes machine learning identify significant correlations that aren’t related to the context or processes that haven’t been considered before.
- Sometimes it provides interesting insights for domain experts.
- Sometimes those predictors shouldn’t be taken into account by domain experts because different signals are more accessible or easier to work with.
When it comes to assessing the usefulness of a feature, a simple thing to check is how complete it is. In other words what percentage of data is missing for that particular feature. If the percentage of missing values is really high. It’s unlikely to be useful even with the best data imputation techniques.
Depending on how the feature is constructed or recorded, it might get stale. You need to make sure that features keep their value over time. Another aspect is that we should make sure the feature value meaning remains constant over time.
We should always check the variance of features as part of identification of useful features. Intuitively variance measures how far the values of a feature spread out from their average value. You can literally calculate the variance in the statistical sense by calling the appropriate function. You can also compare the number of unique values to the total number of examples for features drawn from a discrete set. This can be more informative than the statistical variance.
|If the variance is 0||Most likely the feature is constant, which doesn’t add any value to the modeling process.|
|If the variance is near 0 or very low||It’s reasonable to think the features are useless, but be ware that changes can be rare, but still extremely useful.|
When looking at the unique values, it can be a good idea to check the ratio between the most frequent value and the second most frequent value against the percentage of unique values.
Correlated features are another area of interest. Features correlation is the linear relationship between two variables, to what extent do two variables move together:
- a positive correlation means the two variables increase or decrease together,
- a negative correlation means that when one increases the other decreases and vice versa.
Given a set of features, correlation can be calculated for each feature with every other. If any two features are highly correlated with each other (either negatively or positively), it might not be useful to keep them both.
For linear learning algorithms, having correlated features makes the algorithm numerically unstable and unable to find optimal models. This is known as collinearity or multi-collinearity. A linear regression algorithm is really counting on some degree of multi-collinearity with the target (the label), but multi-collinearity within the features themselves can spell trouble and even cause the learning algorithm to fail. To avoid these issues, it could be helpful to:
- transform features using Principal Component Analysis (PCA)
- perform feature selection, or
- use decision trees
Linear algorithms want to have correlation with the label, you might be tempted to do a univariate assessment and throw away individual features that have low correlation with the target. But it’s important to understand that the features are going to be used in a multivariate setting – a feature might not have good univariate correlation with the label, but still be useful for the model interactions between it and other features, may have a significant impact on the prediction power of the final model. So it’s a good idea to test the usefulness of a feature in a multivariate setting.
Complex nonlinear models that can exploit multivariate correlations are not always very explainable.
So usefulness is really in the context of the entire machine learning process from data collection through to the setting in which the model will be used. We might want to restrict ourselves to features that have clear and obvious meanings regardless of what will improve the prediction power, because of how we’re going to be using the model.
How Many Features are Needed
Real world data is very large and complex, we often have too many features or dimensions, it is really not a good idea to use all those features. As the dimensionality of the data increases, the volume of space, where our data points can live, increases as well. In fact, each new dimension exponentially increases the amount of space we’re trying to fill with our learning data.
If we don’t get a corresponding increase in the number of data points, we obviously are going to have less and less coverage. In other words, as the dimensionality of our data increases, our coverage gets more and more sparse, for most statistical or machine learning modeling techniques sparsity is a huge problem. The amount of data needed to have the same level of performance increases exponentially as we add future dimensions. This phenomenon is popularly referred to as the curse of dimensionality.
There are some algorithms that are more affected by the curse of dimensionality than others. For instance, distance measuring algorithms, like kNN or k-mean are greatly impacted. since adding dimensions to the data, literally, increases the distances between examples.
On the other hand, in the Rndom Forest algorithms, individual trees look at subsets of features at a time. They can ignore the vastness of empty space for better and for worse. This focus can make it easier to optimize for each tree. So they don’t feel the curse of dimensionality to the same extent as distance based algorithms, but there’s no getting away from the fact that more features means exponentially more space to explore.
In a perfect world we can identify exactly the minimum feature set we need for the model to perform well, then the machine learning model is not as complex and less easier to interpret. Domain knowledge is one of the most important factors in finding critical features that are most likely to impact the model performance.
Building Good Features
We could use unsupervised learning to transform the data into new representations, attempting to discover the hidden structures that exist in the data (without using any labels). Sometimes, the representations learned by these techniques may actually be more effective than hand-engineering your features.
One of the most common forms of unsupervised learning is clustering, where we are discovering clusters (groups) of data points that are more similar to each other than to data points from other groups. Clustering was the unsupervised analog of classification.
However another form of unsupervised learning called representation learning might be more suitable to help prepare data for use in modelling. In representation learning we try to learn better features from the set of features that we have. This is usually the unsupervised analog of regression.
|Supervised learning||Unsupervised learning|
|Regression||Representation learning (For instance, PCA)|
One common representation learning method is Principal Component Analysis (PCA). In PCA a new set of axes are learned, we can then project our original data onto the new axes. The idea is that these new axes directly capture the most important variations in the data, or the principal components. In many cases, most of the variances in the dataset can be captured in a small number of components, which you can then use as features.
Another example of representation learning is to utilize the power of neural networks. Neural networks can help us learn new representations for our data. In fact, in multi-layer neural networks that do classification, regression, or really any kind of learning:
|all layers except for the very last one||representation learning|
|only the last layer||classification or regression or whatever|
One common explicit use of neural networks for representation learning is by using autoencoders. An autoencoder is a neural network that’s trained so that the output exactly matches the input. We can force certain structures within the autoencoder. Particularly, we can force the autoencoder network to find a small number of features to decompose and then recompose the input. Rather than just copying it directly.
Input → Encoder → Decoder → Output = Input
|Encoder||The first half of the neural network that does the decomposition.|
Encode or transform the data to this compressed form.
|Decoder||Attempt to regenerate that original input.|
Once the autoencoder has been trained, the decoder portion can be removed.
After removing the decoder, then the internal representations that happened in the middle of the autoencoder can be passed as input features to the model. Not only does give us a compressed representation of our original data, but it may also give us better features, that make it easier to differentiate between different classes.
Input → Encoder → Model
When using unsupervised learning, the choice of the number of clusters (in clustering) / components (in PCA) / nodes (in autoencoder) should depends on your data. Choosing these values can have a significant impact on your results. It’s likely that such hyperparameters need to be optimized as part of your pipeline.
Feature Selection: Filter Methods
Feature selection and extraction are an extremely important aspect of building a good predictive model. It all comes down to engineering features that capture hidden insights about the phenomenon you’re trying to predict, and making an informed choice of which variables to choose that have the best impact on predicting the target variable. Selecting good features becomes even more important when the number of features you can choose from are very large. Selecting the right features can help:
- train the machine learning algorithm faster
- make your model less complex and easier to interpret
- reduce over-fitting
There are many filtering methods out there to do feature selection. These techniques are independent of the choice of machine learning algorithm and are generally used as a pre-processing step. Their goal is to select relevant feature subsets that should have the most impact on the models overall performance. Filter methods exploit the intrinsic properties of the features to build the subset that’s most likely to be useful.
Information Gain Filtering
Information gain between two random variables is the measure of mutual dependence between the two variables. How much changing one variable affects the other or technically the amount of information obtained about one variable by observing the other variable.
I(X;Y) = ∫X∫Y p(x, y) log [p(x, y) / (p(x)p(y))] dxdy
The formula determines how similar the joint probability distribution
p(x, y) of two variables x and y is to the products of factored marginal distributions
p(y). If x and y are completely independent then
p(x, y) would be equal to
p(x)p(y) and the information gain would be 0. By calculating Information Gain, the features that contribute more to information gain on the target variable can be selected as relevant features.
Pearson’s Correlation Coefficient
Pearson’s correlation coefficient is one of the techniques used to quantify the linear dependency between two variables X and Y. It is obtained by dividing the covariance of the two variables by the product of their standard deviations.
ρX,Y = cov(X, Y) / σXσY
ρX,Y has a value between 1 and -1, where:
- 1 is the maximum positive linear correlation
- 0 is no linear correlation
- -1 is max negative linear correlation
In cases where two features are highly correlated we can probably drop one or select the one which has the highest correlation with the target variable.
Sometimes it’s useful to simply compute the variance of each feature and then use some threshold value to select a subset of features with variance greater than the threshold. We can sometimes assume that features with high variance have more useful information than others. A feature with absolutely no variance is a constant and tells us nothing.
One of the drawbacks of this however, is that it doesn’t really tell us much about any relationship between features or between a feature and the target variable.
Chi-squared test is used with categorical features to find the likelihood of correlation between features using their distribution.
?2 = Σ (O-E)2/E
In order to establish that two categorical features are dependent, the chi-squared statistic should be above a certain threshold. The threshold value increases as number of classes within the feature increases.
Feature Selection: Wrapper Methods
However, there are other wrapper methods that measure the relevance of features based on the models performance. These methods have the machine learning testing and evaluation process built in, and thus their methods that wrap around the machine learning process, hence the name wrapper methods.
Features → Feature Subsets → ML Algorithm → Measure Performance ↑ - - - - - - - ↓ Wrapper Methods
Wrapper methods directly optimize the models performance and are computationally more expensive compared to filter methods.
- A subset of features are elected to train a model
- Evaluate its performance on those features
- Iteratively, more features are added or removed
- The model is retrained and its performance is evaluated again
We use this iterative process and the changes in performance between iterations with different features to find a good feature subset. This is essentially a search problem to get the subset of features, which optimize the models performance the best.
Forward selection is one of the wrapper techniques where:
- Start with no features
- Iteratively keep adding features that best improve the performance of the machine learning model in each step
- The best feature to add in every iteration is determined by some criteria
- Keep adding features until
- Hitting a limit in the number of features or
- Find that adding that new feature doesn’t improve the performance above our threshold value
In contrast in the backward elimination method:
- Start with having all the features
- Iteratively remove the least important feature based on some predetermined significance level that least degrades the performance of the machine learning model
- Until we reach a point where we hit the limit on how many features we can have or when the degradation becomes too much
In feature selection, techniques are used to create a subset of existing features based on their importance or influenced in predicting the target variable. However, in feature extraction, the goal is to generate useful features from the data which is in a format that’s difficult to analyze directly.
One example of feature extraction is representation learning, which is not domain specific. For instance:
- Principal Component Analysis, an unsupervised feature extraction technique, since it creates new features based on the linear combinations of the original features.
- Linear Discriminant Analysis, a supervised dimensionality reduction feature extraction technique. LDA does not maximize the explained variance. Instead, it maximizes the separability between classes.
While there are other feature extraction methods which are specific to the data or domain, for instance extracting the moving average feature on time series data.
Word embedding is a very good example of transfer learning. Word embeddings are a particular approach to representing textual data where the patterns of co-occurrences of words in the corpus or a defined chunk of text are used to construct a brand new representation or feature set.
The word embeddings can be learned either as part of a supervised learning task or through unsupervised methods over some particular corpus. After learning word embeddings and then those learned representations can be used for another tasks, it’s an example of transfer learning from source task to target task. Source task is where we have lots of relevant data and is similar enough to the target task, that what we learn from it can be usefully used to complete the target task.
Transfer learning can be categorized into:
|Transductive transfer learning||Similar||Different|
|Inductive transfer learning||Different||Similar|
|Unsupervised transfer learning||Different but similar enough||Different but similar enough|
Existing models (called pre-trained models) and using them for other tasks is also an example of transfer learning. The pre-trained models are usually trained on huge datasets, and then those learned representations are stored. For instance, auto-encoders and their hidden layers for representation learning.
Once we have our pre-trained representations, the next step then is to take those learned representations as a start and retrain on the available dataset for the new classification task, the target task. This training phase is called fine-tuning for a new dataset.
Transfer learning helps us to:
- get a headstart on learning so that rather than starting from scratch, carrying over useful knowledge from the past.
- improve the asymptotic performance, when learning has leveled off and the relative change in performance is small.
However transfer learning is notoriously difficult and doesn’t always work the way we’d like.
For more on Building Good Features for Machine Learning, please refer to the wonderful course here https://www.coursera.org/learn/data-machine-learning
Related Quick Recap
I am Kesler Zhu, thank you for visiting my website. Check out more course reviews at https://KZHU.ai
All of your support will be used for maintenance of this site and more great content. I am humbled and grateful for your generosity. Thank you!
Don't forget to sign up newsletter, don't miss any chance to learn.
Or share what you've learned with friends!Tweet