Skip to content

Comprehensive benchmark study of feature selection techniques for predictive machine learning models on tabular data. Various feature selection methods are evaluated across different data characteristics and predictive scenarios.

Notifications You must be signed in to change notification settings

miguelmoralh/feature-selection-benchmark

Repository files navigation

Feature Selection Benchmark

A comprehensive benchmark study of feature selection techniques for supervised machine learning predictive models on tabular data.

Overview

This repository implements and evaluates 20 feature selection methods across different categories (filter, wrapper, embedded, hybrid and advanced) using both synthetic and real-world datasets. The study provides practical insights into the effectiveness of different feature selection techniques across various scenarios. A pipline to generate synthetic datasets with various characteristics and complex relationships is also provided. The only thing not provided in the repository is the data, however synthetic dataset can be generated by running synthetic_data_generation/main.py and the used real-world datasets are available and can be downloaded in Appendix 8.2 of miguel_moral_tfg.pdf file.

Key Features

  • Implementation of 20 feature selection methods
  • Synthetic datasets generation with controlled relationships
  • Real-world dataset preprocessing
  • Different evaluation frameworks for synthetic and real-world datasets
  • Comprehensive benchmarking framework
  • Visualization tools for results analysis

Installation

git clone https://github.com/miguelmoralh/feature_selection_benchmark.git
cd feature_selection_benchmark
pip install -r requirements.txt

Usage

  1. Generate synthetic datasets:
python synthetic_data_generation/main.py
  1. Process real-world datasets:
python generate_real_world_metadata.py
  1. Run benchmarks:
python main.py
  1. Generate results and visualizations:
python generate_results.py
python generate_plots.py

Repository Structure

feature_selection_methods/

Implementation of various feature selection techniques:

Filter Methods

  • Bivariate

    • information_value.py: Implements Weight of Evidence and Information Value based selection
    • correlation.py: Uses correlation coefficients for feature selection
    • norm_mutual_info.py: Implements Normalized Mutual Information selection
    • chi_squared.py: Chi-squared statistical test based selection
  • Multivariate

    • fcbf.py: Fast Correlation-Based Filter selection
    • mrmr.py: Minimum Redundancy Maximum Relevance algorithm selection
    • relief_algorithms.py: Relief family algorithms selection

Embedded Methods

  • Importance
    • rf_feature_importances.py: Random Forest importance-based selection
    • cb_feature_importances.py: CatBoost importance-based selection
    • permutation_feature_importance.py: Permutation importance selection implementation

Wrapper Methods

  • Backward Elimination
    • sequential_backward_selection.py: Sequential Backward Selection algorithm
  • Forward Selection
    • sequential_forward_selection.py: Sequential Forward Selection algorithm
  • Bidirectional
    • sequential_forward_floating_selection.py: Sequential Forward Floating Selection implementation
    • sequential_backward_floating_selection.py: Sequential Backward Floating Selection implementation

Advanced Methods

  • boruta.py: Boruta algorithm implementation
  • shap.py: SHAP-based feature selection

Hybrid Methods

  • Advanced-Wrapper
    • shap_sfs.py: SHAP combined with Sequential Forward Selection
  • Embedded-Wrapper
    • recursive_feature_elimination.py: Recursive Feature Elimination selection implementation
  • Filter-Wrapper
    • nmi_sfs.py: Mutual Information with Sequential Forward Selection
    • fcbf_sfs.py: FCBF with Sequential Forward Selection

synthetic_data_generator/

  • Config
    • dataset_config.py: Configuration for synthetic dataset generation
    • interactions.py: Defines feature interaction types
    • transforms.py: Implements feature transformations
  • base_random_generator.py: Base feature generation functionality
  • feature_importances.py: Feature importance calculation
  • fs_configs.py: Feature selection configurations
  • main.py: Main synthetic data generation script
  • utils.py: Utility functions for data generation

Core Scripts

  • benchmark_loop.py: Main benchmarking implementation
  • constants.py: Project-wide constants
  • execution_functions.py: Feature selection execution functions used in benchmark_loop.py
  • generate_plots.py: Results visualization
  • generate_real_world_metadata.py: Real-world dataset preprocessing
  • generate_results.py: Results compilation and analysis
  • main.py: Main execution script
  • params_config.py: Model parameters configuration

utils/

  • utils_datasets.py: Dataset loading and processing utilities
  • utils_methods.py: Common method utilities
  • utils_preprocessing.py: Data preprocessing functions
  • utils_results_and_plots.py: Results processing and visualization utilities

Results

The benchmark results are stored in the logs directory:

  • logs/benchmark/: Raw benchmark results
  • logs/results/: Processed results and analysis
  • logs/plots/: Generated visualizations

Paper

  • 'feature_selection_benchmark.pdf': The written paper of the study

Citation

If you use this work in your research, please cite:

@article{moral2025benchmark,
 title={Benchmark of feature selection techniques for tabular data},
 author={Moral, Miguel},
 journal={Universitat Autònoma de Barcelona},
 year={2025}
}

Contact

Miguel Moral - miguel.moral@autonoma.cat Project Link: https://github.com/miguelmoralh/feature_selection_benchmark

About

Comprehensive benchmark study of feature selection techniques for predictive machine learning models on tabular data. Various feature selection methods are evaluated across different data characteristics and predictive scenarios.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages