This is a page evaluating progress in using and training machine learning models for speech synthesis using Python, in particular for voice cloning. The page documents the output audio quality of the TTS solutions I’ve been working on, and how they’ve evolved as I’ve built upon them. This page is purely for educational purposes and for demonstrating the quality of results.
Links:
Figure 1 illustrates the idea for the overall “pipeline” to create a state-of-the-art voice cloning solution. The concept was to take a strong Text-to-speech model, which would generate authentic flow, and pair that with a Speech-to-speech model that could much better copy certain acoustic traits and the fine details and accentuations.
When deciding on a voice, I ideally wanted a fictional voice that would make for an interesting example, and have a good amount of data to train with. I went for Ranni the Witch from Elden Ring, as she had the largest amount of dialogue in the game (around 250 voice lines which amounted to roughly around ~20 mins of speaking audio).
I will say that this was probably quite a difficult first voice to throw at it, and as you will see in the examples below: the accent, pronunciations, and timbre are very unique, and much of the vocab used in the dialogue is old-fashioned. Even still, by the end of it we saw some incredible results, and I was very pleased with the overall quality and accuracy that was possible. The fact that it coped as well as it did for such a boundary-case also makes me very optimistic about training with more typical or “conventional” datasets.
We actually started off with training RVC first. Due to the complexities of training XTTS and it requiring some more attention, we tried to use the zero-shot voice cloning feature rather than a full fine-tuned XTTS model.
First you can see the reference audio pulled from the game files, needed by XTTS to generate audio (even when trained). This was the reference audio used for most of the generations in the results.
The second row is the XTTS generation from a zero-shot cloning attempt using a few reference audios. Row 3 is the trained RVC model on top of this, and row 4 is a second RVC training attempt that was trained for longer, and is what will be used hence forth.
Audio | Comments |
---|---|
Reference audio 153 "Each of us was chosen by our own Two Fingers, as a candidate to succeed Queen Marika, to become the new god of the coming age." |
An original piece of speaker audio from the training dataset, used as a reference for the XTTS generations. |
XTTS_test_ranni_multiref_zeroshot.wav "It took me quite a long time to develop a voice, and now that I have it, I am not going to be silent." |
TTS only. Zero-shot cloning with a few reference audios does not provide anywhere near a usable level of cloning accuracy. |
RVC_ranni1_train_on_XTTS_test.wav "It took me quite a long time to develop a voice, and now that I have it, I am not going to be silent." |
Passing the TTS output through the RVC speech-to-speech (S2S) voice conversion. There is a drastic increase in quality, and it manages to handle a surprising number of accentuations/quirks of the speaker’s voice, given the terrible base TTS output it had to work with. This demonstrates the importance of using S2S to further refine TTS generation accuracy. |
RVC_ranni2_train_on_XTTS_test.wav "It took me quite a long time to develop a voice, and now that I have it, I am not going to be silent." |
An attempt using a second, longer trained RVC speech to speech model, and the one that will be used from here on out. |
Overall, we can see that the trained RVC model bring significant benefits to the voice accuracy, and it’s clear that it just needs a strong base audio. It’s difficult to distinguish a “winner” between Ranni 1 and 2 RVC models without spending hours comparing dozens of generations, just that they sound slightly “different” acoustically speaking.
The second stage saw the addition of a full trained XTTS model, and was a massive turning point for the accuracy of the voice cloning. Below you can see 2 example generations, temp 1 and temp 2, and each with and without RVC on top. The quality of even just XTTS was a huge surprise. Although not good enough to be “fully immersive” on its own, I was really impressed to see it do so well before any speech-to-speech conversion, especially compared to the zero shot.
Audio | Comments |
---|---|
Reference audio 153 "Each of us was chosen by our own Two Fingers, as a candidate to succeed Queen Marika, to become the new god of the coming age." |
Reference audio |
xtts_temp.wav "It took me quite a long time to develop a voice, and now that I have it, I am not going to be silent." |
As usual, first we have the TTS output (XTTS) only. This time, we have an actual trained XTTS model. It gets somewhere in the ballpark of the original speaker voice- respectably on its own without S2S conversion using RVC. It has small amounts of noise, which RVC will smooth out. |
xtts_temp_w_rvc.wav "It took me quite a long time to develop a voice, and now that I have it, I am not going to be silent." |
By combining RVC’s capabilities with a stronger base TTS output this time, we again see a massive jump in voice cloning accuracy, extremely similar to the original. This is a great base quality to build from, and we can now focus on other areas discussed below. |
xtts_temp2.wav "Ah, a dictionary in Python, much akin to a repository of paired knowledge, doth serve as a mutable collection of key-value pairs. Each key acts as a unique identifier, whilst its corresponding value holds the data associated with it. This structure alloweth for swift retrieval, insertion, or modification of data, provided the key is known. One might envision it as [audio cutoff]” |
XTTS-only generation, response by ChatGPT instructed to explain something in speaker’s mannerisms. We see here a point of struggle- longer generations. XTTS reaches its limit and truncates the audio, but even before this point we see the audio start to lose accuracy and sound more like standard text to speech (“retrieval, insertion, or modification”). We also see irregular flow or speeding up of audio. |
xtts_temp2_w_rvc.wav "Ah, a dictionary in Python, much akin to a repository of paired knowledge, doth serve as a mutable collection of key-value pairs. Each key acts as a unique identifier, whilst its corresponding value holds the data associated with it. This structure alloweth for swift retrieval, insertion, or modification of data, provided the key is known. One might envision it as [audio cutoff]” |
Putting this through S2S conversion, RVC makes an impressive attempt at preserving accuracy, and the result is still largely usable. However, since it is just S2S, determining the bulk of the emotional expression and flow is outside the scope of its abilities, and it is not able to fix the fundamental issues in this regard. |
gambling_prompt_xtts_w_rvc.wav “Ah, thou does tempt fate with reckless abandon, does thou not? Shall I prepare a eulogy for thy savings now, or later, once the dice hath rolled and fortune laughed in thy face?” |
XTTS + RVC. A response to silly prompt asking what it’s thoughts were on spending all your money gambling. Similar idea to the previous example, the generation is somewhat long, and the second out of the two sentences causes slightly odd flow, heard at the end as the pitch continuously climbing somewhat unnaturally (“fortune laughed in thy face?”). |
The bottleneck for quality as we move to the next stage is length of the input text and the audio generation. The speaker’s manner of speech (and thus training data) consists of short sentences and clauses anyways, which will hinder inference quality on larger sentences. But more importantly, XTTS itself reaches its limit for a single inference. This results in either strange pitch and flow behaviour, or flat-out truncated audio (with longer text, the audio does not proceed any further past a certain point, and in rare cases might even miss a clause from the middle of generated sentences).
In this stage, we introduced “batched” XTTS inference- performing multiple separate instances of TTS generation and then adding the audios together to get the final output. To achieve this, we had to introduce much more thorough string processing.
In previous stages, text to be read was handed to XTTS by a singular string variable which was set in the script directly, and had to be changed manually each time. We first switched this to a list, breaking up the sentences and iterating XTTS inference separately over each sentence in the list. Once XTTS could iteratively infer audio pieces and put them together, we implemented proper, automated text handling. Text is now read from a .txt file rather than being hardcoded into the script, and we use the Regular Expressions library to split input text into sentences upon its punctuation.
The benefit of this was that it unlocked reasonably-speaking limitless generation length, without the downside of artifacts or strange behaviours observed from longer audios in the previous stage. Below is the model’s narration of a quote by Seneca, that was 287 characters long (beyond XTTS’s rough limit of 250).
Audio | Comments |
---|---|
Reference audio 153 "Each of us was chosen by our own Two Fingers, as a candidate to succeed Queen Marika, to become the new god of the coming age." |
Reference audio |
ranni_seneca_quote_xtts_batched.wav "The greatest obstacle to living is expectancy, which hangs upon tomorrow and loses today. You are arranging what lies in Fortune's control, and abandoning what lies in yours. What are you looking at? To what goal are you straining? The whole future lies in uncertainty; live immediately!" |
XTTS only first. We generated TTS from a quote by Senaca. Since each sentence now gets its own separate inference/generation, with them all being concatenated together at the end, we get consistent flow across the generation. Interestingly, there appears to be some slightly increased noise in these tests. It could be hardware/memory management settings automatically causing slightly reduced performance. Will require deeper inspection. |
ranni_seneca_quote_xtts_w_rvc.mp3 "The greatest obstacle to living is expectancy, which hangs upon tomorrow and loses today. You are arranging what lies in Fortune's control, and abandoning what lies in yours. What are you looking at? To what goal are you straining? The whole future lies in uncertainty; live immediately!" |
As expected, RVC elevates the quality of audio further, leading to a great result. A little bit of noise (“the whole [future]”) gets through, but otherwise perfect conversion. |
Overall, this creates the most impressive result yet. It’s also noteworthy that since each sentence is from its own separate inference, XTTS is not doing anything to “match-up” the flow from sentence to sentence to make transition seamless. In spite of this, there appears to be some consistent continuity between sentences, which is great to see.
Further improvements can take a few directions. Firstly, long sentences have proven to be difficult for XTTS, and especially so for this trained voice model. The Regular Expressions library currently splits text on ellipses, full stops, exclamation and question marks. For text with excessively long sentences, I find myself tweaking text by hand and adding my own punctuation into sentences that get too long. It will be beneficial to improve the text processing further, such as by allowing Regular Expressions to split text on more punctuation such as semi-colons, or in extreme cases commas.
Secondly, I’d like to point out the variety of flow and tone. I think that XTTS does an impressive job of keeping sentences from sounding identical and monotonous, considering it only has one short “reference audio” which it uses to generate the flow for all of these sentences. For additional variety however, it is more than likely possible to have multiple reference audios. Prior to generation, XTTS uses the reference audio to compute/generate the speaker’s “conditional latents” and “speaker embedding”. It is likely technically possible to configure the inference such that 2 or 3 reference audios are used to generate 2 or 3 sets of speaker latents/embedding, and have a pre-inference step that judges and assigns which set is more appropriate to use for a given sentence. This would require some experimenting however, and could have detrimental effects on the continuity and flow of the audio if the switching does not sound seamless.
Of course, it’s worth also revisiting the XTTS training, as we did with RVC in the beginning. By paying closer attention to parameters before training (such as audio length), better training processes, and potentially stitching some audio data together to create longer sentences, we may be able to train up a more capable “version 2” of the finetuned (trained) XTTS model.