A comprehensive benchmark study of feature selection techniques for supervised machine learning predictive models on tabular data.
This repository implements and evaluates 20 feature selection methods across different categories (filter, wrapper, embedded, hybrid and advanced) using both synthetic and real-world datasets. The study provides practical insights into the effectiveness of different feature selection techniques across various scenarios. A pipline to generate synthetic datasets with various characteristics and complex relationships is also provided. The only thing not provided in the repository is the data, however synthetic dataset can be generated by running synthetic_data_generation/main.py and the used real-world datasets are available and can be downloaded in Appendix 8.2 of miguel_moral_tfg.pdf file.
- Implementation of 20 feature selection methods
- Synthetic datasets generation with controlled relationships
- Real-world dataset preprocessing
- Different evaluation frameworks for synthetic and real-world datasets
- Comprehensive benchmarking framework
- Visualization tools for results analysis
git clone https://github.com/miguelmoralh/feature_selection_benchmark.git
cd feature_selection_benchmark
pip install -r requirements.txt
- Generate synthetic datasets:
python synthetic_data_generation/main.py
- Process real-world datasets:
python generate_real_world_metadata.py
- Run benchmarks:
python main.py
- Generate results and visualizations:
python generate_results.py
python generate_plots.py
Implementation of various feature selection techniques:
-
Bivariate
information_value.py
: Implements Weight of Evidence and Information Value based selectioncorrelation.py
: Uses correlation coefficients for feature selectionnorm_mutual_info.py
: Implements Normalized Mutual Information selectionchi_squared.py
: Chi-squared statistical test based selection
-
Multivariate
fcbf.py
: Fast Correlation-Based Filter selectionmrmr.py
: Minimum Redundancy Maximum Relevance algorithm selectionrelief_algorithms.py
: Relief family algorithms selection
- Importance
rf_feature_importances.py
: Random Forest importance-based selectioncb_feature_importances.py
: CatBoost importance-based selectionpermutation_feature_importance.py
: Permutation importance selection implementation
- Backward Elimination
sequential_backward_selection.py
: Sequential Backward Selection algorithm
- Forward Selection
sequential_forward_selection.py
: Sequential Forward Selection algorithm
- Bidirectional
sequential_forward_floating_selection.py
: Sequential Forward Floating Selection implementationsequential_backward_floating_selection.py
: Sequential Backward Floating Selection implementation
boruta.py
: Boruta algorithm implementationshap.py
: SHAP-based feature selection
- Advanced-Wrapper
shap_sfs.py
: SHAP combined with Sequential Forward Selection
- Embedded-Wrapper
recursive_feature_elimination.py
: Recursive Feature Elimination selection implementation
- Filter-Wrapper
nmi_sfs.py
: Mutual Information with Sequential Forward Selectionfcbf_sfs.py
: FCBF with Sequential Forward Selection
- Config
dataset_config.py
: Configuration for synthetic dataset generationinteractions.py
: Defines feature interaction typestransforms.py
: Implements feature transformations
base_random_generator.py
: Base feature generation functionalityfeature_importances.py
: Feature importance calculationfs_configs.py
: Feature selection configurationsmain.py
: Main synthetic data generation scriptutils.py
: Utility functions for data generation
benchmark_loop.py
: Main benchmarking implementationconstants.py
: Project-wide constantsexecution_functions.py
: Feature selection execution functions used in benchmark_loop.pygenerate_plots.py
: Results visualizationgenerate_real_world_metadata.py
: Real-world dataset preprocessinggenerate_results.py
: Results compilation and analysismain.py
: Main execution scriptparams_config.py
: Model parameters configuration
utils_datasets.py
: Dataset loading and processing utilitiesutils_methods.py
: Common method utilitiesutils_preprocessing.py
: Data preprocessing functionsutils_results_and_plots.py
: Results processing and visualization utilities
The benchmark results are stored in the logs
directory:
logs/benchmark/
: Raw benchmark resultslogs/results/
: Processed results and analysislogs/plots/
: Generated visualizations
- 'feature_selection_benchmark.pdf': The written paper of the study
If you use this work in your research, please cite:
@article{moral2025benchmark,
title={Benchmark of feature selection techniques for tabular data},
author={Moral, Miguel},
journal={Universitat Autònoma de Barcelona},
year={2025}
}
Miguel Moral - miguel.moral@autonoma.cat Project Link: https://github.com/miguelmoralh/feature_selection_benchmark