-
Notifications
You must be signed in to change notification settings - Fork 23
Multispeaker Models
To create a multispeaker model using tools such as the Google Colab, you simply place multiple folders within the zip file you upload to prepare your files. Each folder represents each speaker. Speakers can be removed before creating the model which can be used within software such as OpenUtau.
Multispeaker models are not similar to UTAU or VOCALOID appends in which each voice is a self-contained entity. Though those appends can set up to be used from within the original voice bank, the existence of the appends has no bearing on the original voice bank.
Each multispeaker model works together to train the data.
Upon loading your model, OpenUtau will more or less just treat the bank as a single speaker model. It is through the Voice Color and the Voice Color Curve parameter that you can use each speaker separately. While it can be annoying to draw curves, you can set one of the voice color curve parameters (which contain the name of the speaker you wish to use) to have a default value such as 50 if you wish to morph each speaker into one single voice for an entire song.
Aside from allowing different "voice colors" to choose from, multispeaker models can be used to improve your own data sets. Multispeaker models can be used to make it possible to convert even a CV UTAU library to a functional Diffsinger model as the necessary variance information can be taken from another dataset.
How you handle language support is up to you. The one rule is that you cannot have conflicting phonemes in the same data set. If you train a Japanese dataset and an English dataset in the same multispeaker model, the phoneme "r" will become muddled and strange.
If you label your Japanese vowels with Japanese phonemes and your English vowels with English phonemes, then there shouldn't be noticeable bleed through to make either language sound more or less like the other on a phoneme level. However, the variance model may still suffer if your Japanese and English singing are vastly different.
All phonemes within all datasets are inherited by all other datasets.
It should be said that you should never use data you do not have permission to use for training. However, there are conflicting opinions on what usage is appropriate. Some believe that if a speaker is dropped from the model, it is irrelevant if the dropped speaker had been used in training.
Multisinger models can be used to expand the emotional range of a model, provide more data for a more polished voice, or to add extra language capabilities. It is unknown the exact legalities of using others data without explicit permission given the speaker is dropped before creating the model. It is best to be careful and always ask for permission if it is not explicitly stated in the dataset.