Multivocal AI Voice 2: The Pooled Voice

This is the second of the three voice explorations in our multivocal AI voice cloning explorations.

This voice is created by providing a dataset, which consists of audio files from two different speakers. The two speakers are actually the same, but one of them is pitched down. The voice data comes from Kimberly Krause’s reading of Eight Girls and a Dog, found on the public domain audiobook site Librivox. We call this The Pooled Voice to reflect the fact that the voices have been mixed into the same pool.

The voice is trained using Tacotron2 in justinjohn036’s Google Colab notebook.

You can hear an example of the first voice in the dataset here saying “Margy, she said despairingly, I hope you packed the medicine chest I gave you”:

An example of the second voice in the dataset can be heard here saying “Chapter one of Eight Girls and a Dog”:

With this approach, we have decided to go in a different direction than what is generally recommended for voice cloning, which is to have one speaker per dataset. By mixing multiple voices into the dataset instead, we encourage the software to consider both of the voices as part of the same synthetic voice.

However, the resulting synthetic voice does not manifest as a combination of both speakers. Instead, when synthesizing new utterances, the model tends to choose one of the two voices to speak with. It seems as if the model first decides which voice is most likely to be speaking the initial segment of the generated audio, and this decision sets the precedent for the rest of the audio. This makes sense considering that a model like this is more or less an “autocorrect for sound”. It starts by creating the first piece of audio, and then piece-by-piece adds more and more to match the provided sentence.

In this synthetic utterance by The Pooled Voice, we can hear that the pitched down voice is used for the sentence “Choose the voice you’d like to use”:

When instructed to say the words “Hello world”, The Pooled Voice produces the original non-pitched voice from the dataset:

Running the model with the same sentence multiple times always gave the same results, where one voice would be produced instead of another.

Artistically, The Pooled Voice has limited potential. In the end, it seems to act as a sort of random picker of voices. The artist has no direct control over which voice gets produced, and as such a certain level of control is given up to the machine learning model. Yet since machine learning is based on statistical inference, this loss of control cannot simply be replaced by an auxiliary random function.

Leave a Reply

Your email address will not be published. Required fields are marked *