Multivocal AI Voice 1: The Choral Voice

This is the first of the three voice explorations in our multivocal AI voice cloning explorations.

This voice is created by layering three editions of the same voice on top of each other. The original voice comes from Kimberly Krause’s reading of Eight Girls and a Dog, found on the public domain audiobook site Librivox. One of the voices in our dataset is pitched down, another is pitched up and the third voice is the default pitch. This somewhat emulates the idea of a choir, which is why we named this approach The Choral Voice.

The voice is trained using Tacotron2 in justinjohn036’s Google Colab notebook.

You can hear an example from the choral dataset reading the text “Have your pitchers bigger than your pillows and the thing is done” below:

After a couple of hours of training, the result of the voice clone is a synthetic voice that sounds quite a lot like the original data. One of the main downsides by this approach, however, is that the intelligibility of both the original voice data and the resulting synthetic voice is very low. It is hard to tell what is being said with this type of multivocal voice. However, we were positively surprised that the voice cloning software was able to faithfully reproduce audio with multiple voices inside it.

You can hear an example of The Choral Voice saying “Choose the voice you’d like to use” below:

From an aesthetic point of view, The Choral Voice approach shows that the machine learning model is quite capable of picking up the connection between text and audio, even when the training data has low intelligibility. This opens up the opportunity for manipulating the original dataset even more to create novel ways of synthesizing voices. As long as it is possible to establish a statistical connection between utterance and text, Tacotron2 seems capable of reproducing the original in the synthesized version.

Leave a Reply

Your email address will not be published. Required fields are marked *