17 February 2024

How to Select Features for Your Machine Learning Model?

Feature selection distinguishes the best data scientists from the others. Get techniques and tips for choosing the best features effectively.

A cat choosing fish in a store named 'Feature Selection.'

When data scientists want to increase the performance of their machine learning models, feature selection is often their starting point. This process involves identifying the most relevant variables that help predict the the target variable. By carefully choosing the right set of features, data scientists and machine learning practitioners can improve their models' efficiency, speed, and interpretability, leading to a notable increase in performance and accuracy.

This article will guide you through everything you need to know about picking the right features for your machine learning model, covering:

What is feature selection?
Feature selection techniques
- Filter methods
- Wrapper methods
- Embedded (intrinsic) methods
- Unsupervised methods
- Dimensionally reduction techniques
Which method to choose?
Practical tips for effective feature selection
General guide on how to approach feature selection
Final thoughts

What is Feature Selection?

A favourite expression among professionals in statistics, data analysis, and data science was coined by George Fuechsel, an IBM programmer and instructor, in the early 1960s:

Garbage in, garbage out.

Indeed, a system can only learn effectively if the training data contains sufficient relevant features and minimal irrelevant ones. This is where the processes of feature engineering and feature selection come into play, serving to prepare a set of features that will contribute to building a high-performing and accurate model.

Aurélien Géron in his book ‘Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow’ determines feature selection as a part of feature engineering process:

A critical part of the machine learning project is coming up with a good set of features to train on. This process, called feature engineering, evolves three steps:
1. Feature selection (selecting the most useful features to train on among existing features)
2. Feature extraction (combining existing features …)
3. Creating new features by gathering new data.

Other sources differentiate feature selection from feature engineering as follows: Feature engineering is about creating new features, using domain knowledge, from raw data that make machine learning algorithms work, while feature selection is about choosing the best subset of all available features for use in the model.

In essence, the feature selection process helps to limit the features to a manageable number, reducing the risk of overwhelming the algorithms or the people interpreting the model. It also decreases computation time by minimising the number of data transformations required.

Although the approach may vary across different techniques, but the primary objective of feature selection is to find out which features that most significantly impact the model's performance.

Feature Selection Techniques

There are several techniques for feature selection, each with its own approach and application context.

Feature selection methods can be categorised into two types:

Supervised: These methods use the target variable to evaluate the relevance of features. The objective is to remove irrelevant variables.
Unsupervised: These methods do not use the target variable and instead focus on the intrinsic structure and relationships within the data. The objective is to remove redundant variables.

Supervised methods for feature selection can be further divided into three categories:

Filter Methods: Select subsets of features based on their statistical relationship with the target variable, without involving any machine learning algorithms.
Wrapper Methods: Search for high-performing subsets of features by assessing the performance of a specific machine learning model with different combinations of features.
Embedded (or Intrinsic) Methods: Feature selection happens naturally as part of the model training process, leveraging the model's own mechanisms to determine which features are most important.

Feature selection is also related to dimensionality reduction techniques, which are primarily used to reduce the number of input variables in a dataset by transforming the original features into a smaller set of new features, capturing most of the essential information with less data.

Let's explore the most commonly used methods for the feature selection process in each group more precisely.

Filter Methods

Filter methods for feature selection evaluate the importance of features based on their statistical relationship with the target variable, independent of any machine learning model. These methods are called "filter" methods because they act as a filter, selecting only the most relevant features before the modelling stage begins.

In practice, filter methods are often used as a preliminary step in feature selection to quickly reduce the number of features to a more manageable size before applying more complex selection methods or building models.

Key steps:

Choose statistical measure based on the variable data type.
Calculate statistical measure for all the features.
Select top-ranking features or features with scores above a certain value (if threshold is applied).

The choice of statistical measures for filter feature selection is highly dependent upon on the type of input (feature) and output (target variable). Common types of variables are numerical and categorical. Thus, we can divide statistical measures for filter feature selection into four groups (scenarios).

1. Numerical Input, Numerical Output

Pearson's Correlation: Measures the linear correlation between two continuous variables.
Spearman's Rank Correlation: A non-parametric measure for ordinal or continuous variables, assessing monotonic relationships.
Kendall's Tau: A non-parametric measure that assesses ordinal associations, useful for continuous variables with ties.
Distance Correlation (dCor): Captures both linear and non-linear relationships between two continuous variables.
F-Test (in the context of regression): Evaluates the linear dependency between a continuous feature and a continuous target.

2. Numerical Input, Categorical Output

ANOVA (Analysis of Variance): Assesses whether the means of two or more groups (categorical target) are statistically different based on a continuous feature.
F-Test (in the context of ANOVA): Used to compare variances across groups, helping to identify continuous features that discriminate between categories.
Fisher Score: Evaluates how well individual numerical features can separate or discriminate between different categories or classes in the target variable. The higher the Fisher Score of a feature, the better it is considered at discriminating between the classes, making it a valuable feature for classification tasks.

3. Categorical Input, Numerical Output

Point-Biserial Correlation: Used when one variable is continuous and the other is binary (a special case of a categorical variable).
ANOVA (Analysis of Variance): While typically used for a numerical input and categorical output, ANOVA can also be adapted to assess the impact of a categorical input on a continuous outcome by treating the categorical variable as a factor in a regression model (ANOVA regression).

4. Categorical Input, Categorical Output

Chi-Square Test: Evaluates the independence between two categorical variables.
Mutual Information: Although applicable to various data types, mutual information is particularly useful for capturing the dependency between categorical variables.
Information Gain: Evaluates how much information about the target variable's outcome is gained by knowing the value of the input feature. It's a key metric in algorithms like ID3, C4.5, and CART for building decision trees, where it helps in selecting the features that best split the data into groups with homogeneous (or purer) target variable classes. While Information Gain is most commonly associated with categorical data, it can also be adapted to work with: ‘Numerical Input, Categorical Output’ and ‘Categorical Input, Numerical Output’ scenarios.

It's worth noting that some measures, like Mutual Information, Information Gain, Maximal Information Coefficient and ANOVA, can be flexible and applied across different types of data by adapting their application context. Also, Relief-based algorithms can work across different types of data but are more focused on the feature selection process rather than being a direct measure of association between specific types of input and output.

Relief-Based Algorithms: These algorithms, including Relief, ReliefF, and RReliefF, are useful for both regression and classification problems. They work by iteratively selecting features that distinguish between instances that are near to each other but belong to different classes.
Maximal Information Coefficient (MIC): Part of the "mine" statistics family and is designed to capture a wide range of relationships, including but not limited to linear, exponential, and periodic associations.

The main advantages of filter methods are their simplicity, efficiency, and the fact that they do not require the training of a model. At the same time, filter methods evaluate each feature in isolation, which means they might miss out on important interactions between features that could be relevant for prediction.

Wrapper Methods

Wrapper methods for feature selection involve using a predictive model to evaluate the combination of features and determine their effectiveness in improving model performance. Wrapper methods consider the performance of a selected feature subset within the context of a specific model. This approach makes wrapper methods more computationally intensive but often leads to better feature subsets tailored to the model's performance

The core idea of wrapper methods is to "wrap" a machine learning model with a search algorithm that iterates over various combinations of features, assessing each combination's performance through the model. The search process aims to find the optimal subset of features according to a predefined evaluation criterion, such as accuracy for classification models or mean squared error for regression models.

Forward Selection: This iterative method starts with an empty set of features and adds one feature at a time. At each step, it evaluates all remaining features and adds the one that most improves the model's performance. The process continues until no further improvement is observed.
Backward Elimination: In contrast to forward selection, backward elimination starts with the full set of features and iteratively removes the least significant feature—the one whose removal causes the least performance degradation. This process is repeated until removing any more features results in a significant performance drop.
Recursive Feature Elimination (RFE): RFE combines aspects of both forward selection and backward elimination. It starts with the full set of features and trains the model. It then uses the model (such as a linear model or a decision tree) to estimate the importance of each feature. The least important feature(s) are removed, and the model is retrained on the reduced set of features. This process is repeated until the desired number of features is reached or no further improvement can be achieved.
Exhaustive Feature Selection: This method evaluates all possible combinations of features for a given number range, assessing each combination's performance. Due to the combinatorial explosion of feature subsets, this method is often impractical for datasets with a large number of features but can be very effective for smaller feature sets.

Wrapper methods tailor the feature selection to the specific model, potentially leading to better model performance. They can capture interactions between features that may be missed by filter methods.

On the other hand, the need to train models for numerous feature combinations makes wrapper methods computationally expensive, especially with large datasets and feature sets. There's a higher risk of overfitting the training data, as the feature selection process is heavily influenced by model performance on a specific dataset.

Embedded Methods

Embedded methods for feature selection integrate the feature selection process within the model training algorithm itself. Unlike filter and wrapper methods, which are separate from the model training process, embedded methods use the intrinsic properties of the machine learning algorithms to perform feature selection. This approach can offer a balance between the computational efficiency of filter methods and the model-specific insights of wrapper methods.

Lasso Regression (L1 Regularization): Lasso (Least Absolute Shrinkage and Selection Operator) regression includes a penalty term equal to the absolute value of the magnitude of coefficients. This regularization can shrink some of the coefficients to zero, effectively eliminating those features from the model. Lasso regression is particularly useful for models where we expect many features to be irrelevant.
Ridge Regression (L2 Regularization): While Ridge regression is more about shrinking coefficients to prevent overfitting and doesn't necessarily set coefficients to zero, it can still indicate feature importance based on the size of the coefficients.
Elastic Net: A combination of L1 and L2 regularization, Elastic Net can both select features (like Lasso) and stabilize the model (like Ridge) when features are correlated.
Decision Trees and Ensemble Models (Random Forest, Gradient Boosting): These algorithms inherently perform feature selection by choosing the most informative features for splitting the data at each node. The importance of a feature can be gauged by how frequently it is used to split data across all trees and how much it contributes to reducing variance (for regression) or increasing purity (for classification).
Regularization in Deep Learning: Neural networks can also incorporate regularization techniques that penalize complex models and effectively perform feature selection by reducing the weights associated with less important inputs to minimal values.

The key advantage of embedded methods is that they consider the interaction with the model's performance, similar to wrapper methods, but with lower computational overhead. However, a potential limitation is that, particularly in complex models like deep learning networks, the feature selection process may become less transparent and harder to interpret. This makes it challenging to understand why specific features were selected or discarded.

Unsupervised Methods

Unsupervised feature selection does not use a target variable and instead focuses on the intrinsic structure and relationships within the data.

Variance Thresholding: Removes features whose variance doesn't meet a certain threshold.
Clustering-Based Methods: Group features into clusters based on similarity and select representative features from each cluster.
Entropy-Based Methods: Use entropy to measure the amount of information or uncertainty in a feature.
Principal Feature Analysis: Identifies and retains features that capture the most significant structure or relationships in the data without relying on a target variable.

Dimensionality Reduction Techniques

Dimensionality reduction techniques are primarily used to reduce the number of input variables in a dataset by transforming the original features into a smaller set of new features, capturing most of the essential information with less data. While these techniques are often more about feature extraction rather than feature selection in the strict sense, some can indirectly contribute to feature selection by identifying the most informative features or combinations thereof. Here are some commonly used dimensionality reduction techniques that can be related to feature selection:

Principal Component Analysis (PCA): Transforms the original features into a smaller set of uncorrelated principal components based on variance.
Linear Discriminant Analysis (LDA): Finds the linear combinations of features that best separate two or more classes (supervised, but often listed under dimensionality reduction).
t-Distributed Stochastic Neighbor Embedding (t-SNE): Reduces high-dimensional data to a lower-dimensional space, preserving the similarity between instances.
Autoencoders (in Deep Learning): Neural networks designed to learn a compressed representation of the input data, effectively reducing dimensionality.

Which Method to Choose?

The choice of feature selection method depends on various factors, including the nature of the data, the type of model being used, and computational considerations.

Filter methods are fast and effective for a first pass screening. They are often used as a preliminary step in feature selection to quickly reduce the number of features to a more manageable size before applying more complex selection methods or building models.
Wrapper methods, while potentially offering better performance, are computationally expensive and may not be feasible for very large feature sets.
Embedded methods offer a good balance by incorporating feature selection into the model training process but are limited to specific models.

In practice, a combination of these methods might be used in sequence.

Unsupervised feature selection methods are often used in exploratory data analysis, clustering, or when preparing data for unsupervised learning algorithms.

Practical Tips for Effective Feature Selection

Here are some practical tips to guide through the feature selection process.

Leverage domain expertise to identify potentially relevant and irrelevant features.
Conduct thorough exploratory data analysis (EDA) to understand the distributions, relationships, and peculiarities within your data.
Begin with a simple model using all features to establish a performance baseline, which helps in evaluating the impact of feature selection.
Apply a combination of filter, wrapper, and embedded methods to get a comprehensive view of feature importance from different angles.
Use cross-validation to assess feature subsets' performance, ensuring robustness and generalisability.
Be mindful of overfitting: use regularisation techniques (like L1 and L2 regularisation) to penalise complex models and reduce overfitting.
Consider feature interactions:
- Non-linearity: consider interaction terms or non-linear models if appropriate
- Correlation analysis: be cautious of highly correlated features.
Interpretability is a priority. A model with fewer features is often easier to interpret and explain. Balance model complexity and performance with the need for interpretability.
Utilise automated feature selection libraries and tools, but understand the methods they employ and the assumptions they make.
Treat feature selection as an iterative process.
If possible, validate your model's performance, including the selected features, on an external dataset to ensure that your findings are not specific to a particular dataset.

A Step-by-Step Guide to the Feature Selection Process

Feature selection is an iterative and somewhat subjective process. It's crucial to balance statistical methods with domain knowledge and practical considerations to achieve the best outcome for your specific problem. Here's a simple step-by-step guide to navigate the process:

Step 1: Understand Your Problem and Data

Clearly define your machine learning problem and objectives.
Conduct exploratory data analysis (EDA) to familiarize yourself with the dataset's features, distributions, and potential relationships.

Step 2: Preprocess Your Data

Data cleaning
Convert categorical variables to a format suitable for machine learning models (e.g., one-hot encoding).
Normalisation/ Standardisation: Scale your features to ensure that no variable dominates due to its scale.

Step 3: Establish a Baseline Model

Build a basic model using all available features to establish a performance baseline, which will help you gauge the effectiveness of your feature selection.

Step 4: Apply Filter Methods

Use filter methods to rank features based on their relevance to the target variable.
Choose a subset of features based on statistical significance, relevance scores, or predefined thresholds.

Step 5: Explore Wrapper Methods

Employ wrapper methods to evaluate different feature subsets.
Select the feature subset that offers the best improvement in model performance compared to the baseline.

Step 6: Consider Embedded Methods

Utilise embedded methods if you're working with algorithms that inherently perform feature selection (e.g., Lasso regression, Decision Trees).
Analyse the importance scores provided by these models to identify key features.

Step 7: Iterate and Refine

Iteratively refine your feature subset, re-evaluating model performance and adjusting your feature set as needed.
Use cross-validation throughout the process to ensure that your feature selection generalises well to unseen data.

Step 8: Validate Your Model

Test your final model, with the selected features, on a holdout set to assess its performance on unseen data.
If possible, validate your findings on an external dataset to confirm the robustness of your feature selection.

Step 9: Document and Review

Keep a record of the feature selection methods used, features selected, and the rationale behind these choices. Have your approach reviewed by peers or domain experts to ensure its validity and robustness.

Step 10: Deployment and Monitoring

Continuously monitor the model's performance, as changes in data over time might necessitate revisiting feature selection.

Final Thoughts

Feature selection is both an art and a science, and the best approach often depends on the specifics of your dataset and the problem you are trying to solve. Don't miss out — start mastering the art and science of feature selection today!

REFERENCES

Andriy Burkov, The Hundred-Page Machine Learning Book (Great Britain: Amazon, 2019)
Aurelien Geron, Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. Second Edition (Canada: O’Reilly Media, Inc., 2019)
Jason Brownlee, Data Preparation for Machine Learning: Data Cleaning, Feature Selection, and Data Transforms in Python (Machine Learning Mastery, 2020)
Max Kuhn, Kjell Johnson, Applied Predictive Modeling (New York: Springer, 2013)
Max Kuhn, Kjell Johnson, Feature Engineering and Selection: A Practical Approach for Predictive Models (Boca Raton, FL: Chapman & Hall/CRC Press, 2019)