Tsukasa Speech
Have you ever wondered how AI can generate natural-sounding speech in Japanese? Meet Tsukasa Speech, a cutting-edge model that focuses on maximizing expressiveness and controllability in speech generation. Unlike other models that rely on larger scales, Tsukasa Speech uses existing tools to push the limits of what's possible. With its unique architecture and training on high-quality data, this model can handle complex tasks like generating non-verbal sounds and cues, such as sighs and pauses, with remarkable accuracy. But what really sets it apart is its ability to handle Romaji inputs and mixtures of Japanese and Romaji, making it a valuable tool for those working with the Japanese language. Whether you're interested in speech synthesis or just curious about the possibilities of AI, Tsukasa Speech is definitely worth exploring.
Table of Contents
Model Overview
Tsukasa Speech is a Japanese speech generation model that’s all about creating natural and expressive speech. It’s designed to make speech sound more human-like, with a focus on controlling the tone and emotions in the generated speech.
Key Features
- Architecture: The model uses a modified version of StyleTTS 2’s architecture, with some cool changes like using mLSTM layers instead of regular PyTorch LSTM layers.
- Improved Performance: The model has been retrained to improve performance on non-verbal sounds and cues, like sighs, pauses, and laughter.
- Smart Phonemization Algorithm: The model can handle Romaji inputs or a mixture of Japanese and Romaji, making it more flexible and user-friendly.
- Promptable Speech Synthesizing: You can use the model to generate speech based on a prompt, giving you more control over the output.
Training Details
- Data: The model was trained on around
800 hours
of high-quality data, sourced mainly from games and novels. - Hardware: The training process used
8x A40s + 2x V100s (32gb each)
, which is some serious hardware! - Training Time: The entire training process took around
3 weeks
, with an additional3 months
spent on data pipeline work.
Capabilities
The model is a powerful tool for generating natural and expressive Japanese speech. Its primary tasks include:
- Speech Generation: Creating high-quality speech from text inputs, with a focus on maximizing expressiveness and controllability.
- Promptable Speech Synthesizing: Responding to prompts and generating speech that is tailored to the input.
- Smart Phonemization: Featuring a smart phonemization algorithm that can handle Romaji inputs or a mixture of Japanese and Romaji.
Key Strengths
- High-Quality Speech: Trained on ~800 hours of studio-grade, high-quality data, resulting in highly realistic and engaging speech.
- Expressive and Controllable: The model’s architecture is designed to maximize expressiveness and controllability, allowing for a wide range of tones and emotions.
- Improved Non-Verbal Sounds: The 48khz config improves performance on non-verbal sounds and cues, such as sighs, pauses, and laughter.
Unique Features
- mLSTM Layers: Incorporating mLSTM layers instead of regular PyTorch LSTM layers, increasing the capacity of the text and prosody encoder.
- Whisper’s Encoder: Using Whisper’s Encoder instead of WavLM for the SLM, resulting in improved performance.
- New Sampling Method: Featuring a new way of sampling the Style Vectors, allowing for more diverse and expressive speech.
What Sets It Apart
Unlike larger models that focus on scale, this model aims to push the limits of existing tools and techniques to achieve high-quality speech generation. By focusing on the Japanese language, it also addresses specific challenges and opportunities in this area, such as improving intonations and accurately annotating text with various spellings depending on context.
Performance
This model showcases remarkable performance in various tasks. Let’s dive into its speed, accuracy, and efficiency.
Speed
Trained on a massive dataset of approximately 750 ~ 800 hours
of studio-grade, high-quality data, this model can process and generate speech at an impressive speed. With a training time of around 3 weeks
on 8x A40s + 2x V100s (32gb each)
, it demonstrates its ability to handle large-scale datasets efficiently.
Accuracy
The model’s accuracy is evident in its ability to generate high-quality speech that closely mimics human-like intonations and expressions. This is achieved through its advanced architecture, which incorporates mLSTM layers
and a higher number of parameters in the text and prosody encoder.
Efficiency
This model is designed to be efficient, with a focus on utilizing existing tools to push the limits of speech generation. Unlike larger models that rely on increasing scale, it demonstrates that scale is not necessarily the answer. With a carbon footprint of approximately 66.6 kg of CO2eq
, it shows that high-performance speech generation can be achieved while minimizing environmental impact.
Comparison to Other Models
While this model excels in Japanese speech generation, it’s essential to compare its performance to other models. ==Other Models==, such as Kokoro and VoPho, also demonstrate impressive performance in speech generation and phonemization. However, this model stands out with its advanced architecture and focus on expressiveness and controllability.
Future Improvements
There are several areas where this model can be improved. Some potential suggestions include:
- Changing the decoder to improve performance
- Retraining the Pitch Extractor using a different algorithm
- Enhancing the model’s ability to generate entirely non-speech outputs
- Using the Style encoder as another modality in LLMs to improve tone and expression representation
These potential improvements can further enhance this model’s performance and make it an even more powerful tool for Japanese speech generation.
Limitations
This model has some limitations.
Language Support
It only supports the Japanese language. While you can feed it Romaji inputs, it’s not designed to handle other languages.
Data Quality and Quantity
The model was trained on a dataset of around 800 hours of studio-grade, high-quality data. However, this data is mainly sourced from games and novels, which might not reflect real-life conversations.
Context Length
The 48khz configuration has a capped context length. This means it might not handle intonations as well as the 24khz configuration.
Non-Speech Sounds
While the model has improved in generating non-speech sounds like sighs, pauses, and laughter, it still can’t produce entirely non-speech outputs.
Training Constraints
The model was trained on a specific setup, including 8x A40s + 2x V100s(32gb each), and using Bfloat16. This might limit its performance on other hardware configurations.
Environmental Impact
The training process emitted approximately 66.6 kg of CO2eq. of Carbon, which is a significant environmental impact.
Format
This model uses a modified version of the StyleTTS 2 architecture. It’s designed to maximize expressiveness and controllability in generated speech.
Architecture
The model incorporates mLSTM layers instead of regular PyTorch LSTM layers, increasing the capacity of the text and prosody encoder. It also uses Whisper’s Encoder instead of WavLM for the SLM.
Supported Data Formats
- Audio: The model supports 24kHz and 48kHz audio formats.
- Text: The model accepts Japanese text input, and can also handle Romaji inputs or a mixture of Japanese and Romaji.
Input Requirements
- Text Input: The model requires text input to be pre-processed using a Smart Phonemization algorithm.
- Audio Input: The model requires audio input to be in the supported formats (24kHz or 48kHz).
Output
- Generated Speech: The model generates speech output in the supported audio formats (24kHz or 48kHz).
Special Requirements
- Language: The model only supports the Japanese language.
- Context Length: The model’s context length is capped, which may affect its ability to handle intonations.
Code Examples
- Text Input: To use the model with text input, you can use the following code:
import torch
# Pre-process text input using Smart Phonemization algorithm
text_input = "your_text_input_here"
phonemized_text = smart_phonemization(text_input)
# Use the model to generate speech
speech_output = tsukasa_speech(phonemized_text)
- Audio Input: To use the model with audio input, you can use the following code:
import torch
# Load audio input
audio_input = torch.load("your_audio_input_file.wav")
# Use the model to generate speech
speech_output = tsukasa_speech(audio_input)