SongGLM is a lyrics-to-melody generation system that leverages 2D alignment encoding and a multi-task pre-training framework to ensure alignment and harmony between lyrics and melody.
The overall architecture of our SongGLM framework
SongGLM requires following packages:
- os
- re
- glob
- typing
- lightning
- numpy
- torch
- json
- math
- collections
- multiprocessing
- pickle
- miditoolkit
- pandas
- scipy
- operator
- music21
- itertools
- dtw
- functools
- statistics
- sklearn
You can now run the following code for harmonized N-gram extraction, span sampling and data preparation, multi-task pre-training, fine-tuning, generation, and evaluation.
Harmonized N-gram extraction is primarily used to capture the correspondence between lyric and melody features, with the current considered features including syllable stress in lyrics and melodic peaks or rhythm skeletons in melodies.
Harmonized N-gram extraction is only required for the pre-training data (1000 pieces):
sh script/extract_ngrams.sh
Construct a multi-task pre-training framework incorporating three different scales of autoregressive blank-filling objectives:
- Word-Level: Based on an extracted N-gram lexicon, the maximum matching algorithm is used to randomly sample harmonious N-grams, ensuring that the total length of sampled N-grams accounts for 15% of the note sequence.
- Phrase-Level: Multiple musical phrases are sampled from the note sequence such that their total length accounts for 50% of the original sequence length.
- Song-Level: A continuous span is sampled from the note sequence, covering 50% of the original sequence length.
For pre-training data:
sh script/prepare_data.sh
For fine-tuning data:
sh script/prepare_data_wiki.sh
sh script/pretrain_multitasks.sh
Fine-tune the model in an autoregressive manner on a high-quality lyrics-melody dataset, enabling it to generate melodies from lyrics.
For fine-tuning data (1000 pieces):
sh script/finetune_multitasks_wiki.sh
sh script/generate_multitasks_wiki.sh
sh script/evaluation_multitasks.sh