Skip to content

nttcslab/dcase2025_task4_baseline

Repository files navigation

Spatial Semantic Segmentation of Sound Scenes

This is a baseline implementation for the DCASE2025 Challenge Task 4: Spatial Semantic Segmentation of Sound Scenes.

DCASE2025 Challenge provides an overview of the challenge tasks.

Description

Systems

The system consists of two models, audio tagging (AT) and source separation (SS), which are trained separately. The AT model consists of a pre-trained feature extractor backbone (M2D) and a head layer. For SS, we provide two variants: ResUNet and ResUNetK.

Dataset and folder structure

The data consists of two parts: the Development dataset and the Evaluation dataset. The Development dataset consists of newly recorded sound events and room impulse responses for DCASE2025 Challenge Task 4, along with sound events, noise, and room impulse responses from other available datasets. The Evaluation dataset will be released at a later stage.

The structure of the data is follows ("data/dev_set" folder contains the Develompent dataset):

data
`-- dev_set
    |-- config
    |   |-- EARS_config.json
    |   `-- FSD50K_config.json
    |-- metadata
    |   |-- valid
    |   `-- valid.json
    |-- noise
    |   |-- train
    |   `-- valid
    |-- room_ir
    |   |-- train
    |   `-- valid
    |-- sound_event
    |   |-- train
    |   `-- valid
    `-- test
        |-- oracle_target
        `-- soundscape

The config, metadata, noise, room_ir, and sound_event folders are used for generating the training data, including the train and validation splits.
The test folder contains the test data for evaluating the model checkpoints, including the pre-mixed soundscapes in soundscape and the oracle target sources in oracle_target.

The DCASE2025Task4Dataset: A Dataset for Spatial Semantic Segmentation of Sound Scenes is available at https://zenodo.org/records/15117227.

Related Repositories

Part of src/models/resunet originates from https://github.com/bytedance/uss/tree/master/uss/models
Part of src/models/m2dat originates from https://github.com/nttcslab/m2d
Part of src/modules/spatialscaper2 originates from https://github.com/iranroman/SpatialScaper

Data Preparation and Environment Configuration

Setting

Clone repository

git clone https://github.com/nttcslab/dcase2025_task4_baseline.git
cd dcase2025_task4_baseline

Install environment

# Using conda
conda env create -f environment.yml
conda activate dcase2025t4

# Or using pip (python=3.11)
python -m venv dcase2025t4
source dcase2025t4/bin/activate
pip install -r requirements.txt

Install SpatialScaper

git clone https://github.com/iranroman/SpatialScaper.git
cd SpatialScaper
pip install -e .

SoX may be required for the above environment installation

sudo apt-get update && sudo apt-get install -y gcc g++ sox libsox-dev

Data Preparation

The Development dataset can be donwloaded and placed into data folder as

# Download all files from https://zenodo.org/records/15117227 and unzip
wget -i dev_set_zenodo.txt
zip -s 0 DCASE2025Task4Dataset.zip --out unsplit.zip
unzip unsplit.zip

# Place the dev_set in dcase2025_task4_bas/data folder
ln -s "$(pwd)/final_0402_1/DCASE2025Task4Dataset/dev_set" /path/to/dcase2025_task4_bas
eline/data

In addition to the recorded data, sound events are also added from other dataset as

# Download Semantic Hearing's dataset
# https://github.com/vb000/SemanticHearing
wget -P data https://semantichearing.cs.washington.edu/BinauralCuratedDataset.tar

# Download EARS dataset using bash
# https://github.com/facebookresearch/ears_dataset
mkdir EARS
cd EARS
for X in $(seq -w 001 107); do
  curl -L https://github.com/facebookresearch/ears_dataset/releases/download/dataset/p${X}.zip -o p${X}.zip
  unzip p${X}.zip
  rm p${X}.zip
done

# Add data
cd dcase2025_task4_baseline
bash add_data.sh --semhear_path /path/to/BinauralCuratedDataset --ears_path /path/to/EARS

Verifying data folder structure

cd dcase2025_task4_baseline
python verify.py --source_dir .

Training

All the TensorBoard log and model checkpoint will be saved to workspace.

Audio Tagging Model

Before training, checkpoint of the M2D model should be downloaded as

cd dcase2025_task4_baseline
wget -P checkpoint https://github.com/nttcslab/m2d/releases/download/v0.3.0/m2d_as_vit_base-80x1001p16x16p32k-240413_AS-FT_enconly.zip
unzip checkpoint/m2d_as_vit_base-80x1001p16x16p32k-240413_AS-FT_enconly.zip -d checkpoint

AT model is fine-tuned in two steps:

# Train only the head
python -m src.train -c config/label/m2dat_head.yaml -w workspace/label

# Continue fine-tuning the last blocks of the M2D backbone, replace the BEST_EPOCH_NUMBER with the appropriate epoch number
python -m src.train -c config/label/m2dat_head_blks.yaml -w workspace/label -r workspace/label/m2dat_head/checkpoints/epoch=BEST_EPOCH_NUMBER.ckpt

Separation Model

Two variants of the separation model, ResUNet and ResUNetK, are trained using:

# ResUNet
python -m src.train -c config/separation/resunet.yaml -w workspace/separation

# ResUNetK
python -m src.train -c config/separation/resunetk.yaml -w workspace/separation

Training hyperparameters

Some hyperparameters that affect training time and performance can be set in the YAML configuration files

  • dataset_length in train_dataloader: Since each training mixture is generated randomly and independently, dataset_length can be set arbitrarily. A higher value increases the number of training steps per epoch and may slightly speed up training by reducing the frequency of validation.
  • batch_size in train_dataloader: When using N GPUs, the effective batch size becomes N * batch_size. We found that a larger batch size positively impacts audio tagging model training, but it also increases the training time.
  • num_workers in train_dataloader: Each training mixture loads and mixes 3 to 6 audio samples, which can be time-consuming. num_workers should be set based on the number of GPUs and CPU cores to optimize the dataloading process.
  • lr in optimizer: The learning rate should be adjusted based on the effective batch size, which changes with the number of GPUs or the batch_size in train_dataloader.

Each baseline checkpoint was trained on 8 RTX 3090 GPUs in under 3 days. However, we found that with appropriate hyperparameter settings, similar results can be achieved using fewer GPUs in a comparable amount of time. Examples of configuration files for training with fewer GPUs can be found in config/variants/. The training times and results for these configurations are as follows

AUDIO TAGGING

Configuration can be found at "config/variants/label".
Config CPU GPU Training time (hours) Label prediction
accuracy (%)
m2dat_head m2dat_head_blks total
Baseline 2x AMD EPYC 7343 8 x 3090 (VRAM 24 GB) 26 10 36 59.8
bs32_1gpu Intel Xeon Gold 5218 1 x V100 (VRAM 32 GB) 13 15 28 63.3
bs32_2gpu Intel Xeon Gold 5218 2 x V100 (VRAM 32 GB) 9 20 29 60.7
bs128_1gpu AMD EPYC 7413 1 x A100 (VRAM 80 GB) 36 20 56 57.4
bs128_2gpu AMD EPYC 7413 2 x A100 (VRAM 80 GB) 15 13 28 58.2

SEPARATION

Configuration can be found at "config/variants/separation".
Baseline checkpoint of the audio tagging model is used to calculate all the CA-SDRi.
Config CPU GPU CA-SDRi (dB) at Training time (hours)
36h 48h 72h 96h 120h 144h
resunet (baseline) 2 x AMD EPYC 7343 8 x 3090 (VRAM 24 GB) 11.03
resunetk (baseline) 2 x AMD EPYC 7343 8 x 3090 (VRAM 24 GB) 11.09
resunet_bs6 Intel Xeon Gold 5218 2 x V100 (VRAM 32 GB) 10.23 10.65 10.69 11.31 11.05 11.26
resunet_bs16 AMD EPYC 7413 1 x A100 (VRAM 80 GB) 10.46 10.63 11.13 11.15 11.39 11.42
resunetk_bs6 Intel Xeon Gold 5218 2 x V100 (VRAM 32 GB) 10.21 10.30 10.74 10.95 11.18 11.10
resunetk_bs16 AMD EPYC 7413 1 x A100 (VRAM 80 GB) 10.45 10.56 10.89 11.14 11.40 11.12

Evaluating Baseline Checkpoints

There are three checkpoints for the two baseline systems, corresponding to the ATg model and two variants of the SS models described above. These can be downloaded from the release [version e.g., v1.0.0] and placed in the checkpoint folder as

cd dcase2025_task4_baseline
wget -P checkpoint https://github.com/nttcslab/dcase2025_task4_baseline/releases/download/v1.0.0/baseline_checkpoint.zip
unzip checkpoint/baseline_checkpoint.zip -d checkpoint

Class-aware Signal-to-Distortion Ratio (CA-SDRi) and label prediction accuracy can be calculated on the data/dev_set/test data using the baseline checkpoints as

# ResUNetK
python -m src.evaluation.evaluate -c src/evaluation/eval_configs/m2d_resunetk.yaml
"""
CA-SDRi: 11.088
Label prediction accuracy: 59.80
"""

# ResUNet
python -m src.evaluation.evaluate -c src/evaluation/eval_configs/m2d_resunet.yaml
"""
CA-SDRi: 11.032
Label prediction accuracy: 59.80
"""

# Evaluate and generate estimated waveforms
python -m src.evaluation.evaluate -c src/evaluation/eval_configs/m2d_resunetk.yaml --generate_waveform
python -m src.evaluation.evaluate -c src/evaluation/eval_configs/m2d_resunet.yaml --generate_waveform

To evaluate other model checkpoints, specify their paths under tagger_ckpt and separator_ckpt in the corresponding config files located in src/evaluation/eval_configs.

Citation

If you use this system, please cite the following papers:

  • Binh Thien Nguyen, Masahiro Yasuda, Daiki Takeuchi, Daisuke Niizumi, Yasunori Ohishi, Noboru Harada, ”Baseline Systems and Evaluation Metrics for Spatial Semantic Segmentation of Sound Scenes,” in arXiv preprint arXiv 2503.22088, 2025, available at URL.

  • Masahiro Yasuda, Binh Thien Nguyen, Noboru Harada, Romain Serizel, Mayank Mishra, Marc Delcroix, Shoko Araki, Daiki Takeuchi, Daisuke Niizumi, Yasunori Ohishi, Tomohiro Nakatani, Takao Kawamura, Nobutaka Ono, ”Description and discussion on DCASE 2025 challenge task 4: Spatial Semantic Segmentation of Sound Scenes,” in arXiv preprint arXiv:xxxx.xxxx, 2025, available at URL.