-
Notifications
You must be signed in to change notification settings - Fork 23
Singing Database Creation
Derived from SVS Singing Voice Database - Tutorial, by PixPrucer
Last revision: 16.05.2024
EDITOR NOTE: Cursive text with brackets (like this) stands for a placeholder topic and needs to be expanded upon.
Before starting a singing database production, it's recommended to consider what you want the model to perform. What kind of voice or vocal techniques do you intend the model to replicate, what languages do you want it to produce? DiffSinger, like any Machine Learning algorithm, tries optimising the model to reproduce the input closely. It can't outperform the data it's been trained on, hence it's important to include necessary coverage cases in the database beforehand. You should note your data plans for later reference during the production of the singing database.
(folder structuring)
Pick out songs you already know. Filter through them and remove ones that are too difficult for you to sing, or have primarily adlibs (contain little vocals or lyrics). Be sure to skip any repeating sections. DiffSinger needs diverse and varied data to learn well. Singing an identical section twice or thrice won’t be as beneficial to the model (unless sung with a different key or tempo).
CAUTION: Regarding copyright, singing copyrighted songs for the database falls under fair use. It is safe to include copyrighted material in your database to some degree. This though, only applies provided the performance is used for nonprofit (or) research. Wish to release the database containing copyrighted material publicly (or share commercially) will require you to write an appropriate licence.
Obtaining singing samples is a crucial step in making a database. When recording, make sure the samples are only dry vocals. Any foreign sounds can potentially confuse the model and influence it to produce bad results.
It has also been found that excessive reverb can negatively impact the model. Ensure your environment is curated and has minimal sound reflections.
The WAV files should be saved in 44.1kHz 16bit mono
format in their respective folders.
Once you’ve collected the singing samples, you must label them with time-phoneme annotations. This can be done using one of the following software:
This is the tricky part, as you need to accurately transcribe which phoneme is pronounced in time. Labelling is the "programming" of your voice database. In other words, the model's functionality depends on the way you configure it.
The software primarily used for this task by the community is vLabeler. You can read how to use it in detail on their official site.
(expand on the labelling here, needs visuals)
(what phonetics to use, in reference to OPU DIFFS phonemisers)
(using uta's database converter)
After this step, the database is ready to train.
(editing samples, how various effects affect the output model)
(sofa/labelmakr - speeding up labelling using forced alignment)
(variance vs. speakers - encoding emotion/dynamic in the database)
(slurcutter - pitch prediction training preparation)