This repository provides an example script (tea_ft.py) to fine-tune BioBERT base cased v1.2 models with various TEA datasets by using Hugging Face machine learning libraries. An example result set from running the script for Pathogen Identifier and Strain Tagger corpora is provided below (median results for augmentation experiment/six random seeds/two epochs). The performance is measured by two different test datasets. Regular test dataset contains the baseline examples, while the enriched dataset has similar examples with approximately seven times more unique species and strain names, and can therefore be considered to measure potential overfitting to the species and strain names.
Baseline test dataset | Enriched test dataset | ||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
Baseline test dataset | Enriched test dataset | ||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
Please note: it is very important to fix the broken tokenizer in BioBERT base cased v1.2 distribution as by default it functions in uncased mode. The script takes care of this by forcing the toknizer into cased mode, but you can also use a working tokenizer configuration file from BioBERT base cased v1.1 repository – if needed.
First step to fine-tune the models is to download the TEA datasets from the GitHub repository. This can be done easily by running the following command in the project root directory:
wget https://github.com/tznurmin/TEA_datasets/archive/refs/tags/v1.0.tar.gz -qO - | tar -xz && mv TEA_datasets-1.0 TEA_datasets
Next, install the Python dependencies and you are good to go. Tested on Python 3.8.19, but these instructions should work with any modern Python version.
Set up and activate a virtual environment in the project root:
python -m venv .venv
source .venv/bin/activate
To quickly set up the environment on Linux systems with existing CUDA (11.8) support, run the following commands:
pip install torch --index-url https://download.pytorch.org/whl/cu118
pip install transformers[torch] datasets evaluate seqeval
Newer CUDA versions will most probably also work. See PyTorch website for instructions.
For a comprehensive setup, a requirements.txt file is included for installing the necessary Python packages. Additionally, a flake.nix file is provided for optional system-level configuration (e.g. for system-level CUDA support).
Install the required Python packages by running:
pip install -r requirements.txt
Optionally, use the provided flake in a Nix system by running the following in the root directory:
nix develop
Use tea_ft.py to run the fine-tuning experiments once the dependencies are available and the datasets have been downloaded. The script will select the correct datasets based on the given arguments and automatically creates a log directory for the experimental results.
The fine-tuning script requires two mandatory arguments: --type (pathogens or strains) to select experiment type and --experiment (augmentation, strategy, mix1, mix2 or mix3) to select the experiment.
There are a few optional arguments, such as the used random seed (--seed), the number of epochs (--epochs) and the used batch size (--batch_size). See --h for help when running the script.
For example, pathogens augmentation experiment is run with seed 42:
python tea_ft.py --type pathogens --experiment augmentation --seed 42
This will create log files from running the experiments into 'logs' directory.