Fish Speech released version 1.5 at the beginning of December. This version can be considered a significant upgrade compared to version 1.4. The main improvements in version 1.5 are as follows:
Multilingual and Cross-language Support
Version 1.5 has expanded support for multiple languages, including English, Japanese, Korean, Chinese, French, German, Arabic, and Spanish. Users only need to copy and paste multilingual text into the input box, and the system will automatically recognize and process it, eliminating language barriers.
Phoneme Independence
This model has strong generalization capabilities and does not rely on phonemes for text-to-speech conversion. It can handle text from any language script, ensuring higher flexibility and applicability.
High Accuracy
When processing a 5-minute English text, the model maintains a character error rate (CER) and word error rate (WER) of about 2%, ensuring high accuracy.
Fully End-to-End
This model integrates automatic speech recognition (ASR) and text-to-speech (TTS) functions, requiring no additional plugins or models. It achieves true end-to-end processing, unlike the traditional three-stage (ASR + LLM + TTS) workflow.
Emotional
This model can generate speech with strong emotions, making the synthesized speech more natural and vivid.
Current Status
In fact, although open-source TTS has been developing, there are still some issues. For example, when processing long texts, the quality of the generated speech is still not good enough, and it cannot handle some complex texts. Additionally, for special scenarios such as numbers and dates, the generated speech effect is not ideal. Overall synthesis capabilities fall short compared to closed-source TTS.
From the author’s past testing of open-source TTS, Alibaba Cloud’s CosyVoice and Shanghai Jiao Tong University’s F5 TTS are relatively better-performing models. Here, we will use CosyVoice for comparison.
CosyVoice model version: CosyVoice-300M
Actual Testing
Test GPU: RTX 4060Ti 16G
First, here is the test text:
Chinese Long Text, Length: 456 Chinese Characters
观音听了这话,立刻和木吒撞到了南天门里。早有丘、张两位天师上前询问:“你要去哪里?”菩萨回答:“我想见玉帝一面。”两位天师立刻向宫中禀报,玉帝于是降下龙颜迎接她。菩萨行礼完毕说:“多谢佛祖旨意,我要去东土寻找取经人,途中遇到孽龙,被他吊在半空中,幸亏我请求饶命,佛祖赐我代龙驮载取经人。”五位天帝听后立即传旨赦免那天将,并让他将小龙释放,交给菩萨。菩萨谢恩后离开。那小龙磕头感谢活命之恩,并听从菩萨安排。菩萨将其放入深不见底的峡谷中,等取经人到来时,他便化作一匹白马,助取经人西行立功。小龙领命,潜身不见。
菩萨带着木吒徒弟越过这座山,继续向东土进发。没走多久,忽然看见万道金光和千条瑞气。木吒说道:“师父,那发光的地方应该是五行山了,上面有如来的‘压帖’。”菩萨说:“这是那搅乱蟠桃会、大闹天宫的齐天大圣如今被压在这里。”木吒道:“正是,正是。”师徒二人一起上山去看,发现“唵嘛呢叭[口迷]吽”六字真言上有着如来的“压帖”。菩萨看完感叹不已,并作诗一首。
师徒们正说话时,被大圣听见了。大圣在山根下游泳,高声喊道:
“原来是那个在山上吟诗揭我的短?”菩萨听了便立刻下山来找他。只见土地神、山神以及监管大圣的天将都前来迎接菩萨,并带她来到大圣面前。大圣被压在石匣之下,只能用眼睛说话,但不能动弹。菩萨问道:“孙大圣,你认得我吗?”大圣用火眼金睛点着头,高声喊道:“我怎么会不认识你?你是从南海普陀落伽山救苦救难的大慈大悲观世音菩萨。承蒙你来看我一眼,承蒙你来看我一眼!我现在这里度日如年,没有一个人来看我,你从哪里来呢?”菩萨说:“我奉佛旨东土寻取经人路过这里,特意留下我的残余法力来帮助你。”大圣说:“如来骗了我,把我在这里压了五百多年,我现在动弹不得。我希望能得到菩萨的方便,救我一救!”菩萨说:“你这东西罪孽深重,如果救你出来恐怕你又要造孽。反而不是好事。”大圣说:“我已经后悔了,只愿菩萨开恩,给我一条生路。”这真是:罪人终有报,今日得见菩萨。
English Long Text
Hercule Poirot shivered. The thought of the Christmas countryside at this season of the year did not attract him.
"A good old-fashioned Christmas!" Mr Jesmond stressed it.
"Me - I am not an Englishman," said Hercule Poirot. "In my country, Christmas, it is for the children. The New Year, that is what we celebrate.""Ah," said Mr Jesmond, "but Christmas in England is a great institution and I assure you at Kings Lacey you would see it at its best. It's a wonderful old house, you know. Why, one wing of it dates from the fourteenth century."
Again Poirot shivered. The thought of a fourteenth-century English manor house filled him with apprehension. He had suffered too often in the historic country houses of England. He looked around appreciatively at his comfortable modern flat with its radiators and the latest patent devices for excluding any kind of draught.
"In the winter," he said firmly, "I do not leave London.""I don't think you quite appreciate, Mr Poirot, what a very serious matter this is." Mr Jesmond glanced at his companion and then back at Poirot.
Poirot's second visitor had up to now said nothing but a polite and formal "How do you do." He sat now, gazing down at his well-polished shoes, with an air of the utmost dejection on his coffee-coloured face. He was a young man, not more than twenty-three, and he was clearly in a state of complete misery.
"Yes, yes," said Hercule Poirot. "Of course the matter is serious. I do appreciate that. His Highness has my heartfelt sympathy."
To avoid repetition, the following will refer to Chinese Long Text and English Long Text.
Inference Speed
After testing,
| Model | Text Type | Generation Time | Audio Length |
|---|---|---|---|
| Fish Speech | Chinese Long Text | 23 seconds | 2 minutes 03 seconds |
| Fish Speech | English Long Text | 24 seconds | 1 minute 47 seconds |
| CosyVoice | Chinese Long Text | 55 seconds | 3 minutes 07 seconds |
| CosyVoice | English Long Text | 35 seconds | 1 minute 50 seconds |
Default Effect
Under default settings, the speech generated by Fish Speech is as follows:
Chinese Long Text:
English Long Text:
The speech generated by CosyVoice is as follows:
Chinese Long Text:
English Long Text:
- By default, the Chinese speech speed of Fish Speech is faster. The Chinese speech speed of CosyVoice is much more normal.
- The English speech speed is roughly the same, and the generation quality is excellent for both.
- In terms of emotional expression, the default generation effects are similar, good but not outstanding.
- In terms of sentence reading, Fish Speech performs worse than CosyVoice. However, this may be related to its default speech speed.
Multilingual, Numbers, and Date Reading Comparison
Test Text:
在一个晴朗的早晨,2023年10月15日,李华和他的朋友 Tom 决定去公园 Cosplay。他们计划在公园里度过一个轻松的周末。
The speech generated by CosyVoice is as follows:
The speech generated by Fish Speech is as follows:
Conclusion
- Fish Speech can correctly read multilingual, numerical, and date content.
- CosyVoice performs poorly in mixed Chinese and English scenarios, but the date reading effect is normal.
Zero-Shot Replication
Test Text:
突然,李华的手机响了,是一条来自公司的紧急邮件。邮件中提到,一个重要的项目需要在 2023 年 6 月 1 日前完成。
Test sample audio:
Test Results:
Fish Speech:
CosyVoice:
Conclusion: Both perform well in replicating voice colors. CosyVoice performs worse in sentence reading compared to Fish Speech.
Final Thoughts
The biggest advantage of Fish Speech is its generation speed and memory usage. Its memory usage is very low, which is not only due to the smaller model size but also closely related to its end-to-end design.
Existing Issues
- Poor Pause Handling: Both Fish Speech and CosyVoice-300M handle pauses poorly when generating long texts. The default speech speed of Fish Speech is faster, making this issue more severe and causing listeners to feel fatigued.
- Insufficient Emotional Control: Fish Speech currently does not provide emotional control options. Although the naturally generated emotional effect is acceptable, it is not particularly outstanding. In contrast, CosyVoice offers a specially fine-tuned emotional version (CosyVoice-Instruct), which performs better in this regard.
- Zero-Shot Replication Issues: Both Fish Speech and CosyVoice lose the original emotional expression during zero-shot replication, possibly due to the significant influence of the sample.
Since Fish Speech was just released, more detailed testing is ongoing.
Compared to Version 1.4
Version 1.4 of Fish Speech had serious model hallucination issues and frequently dropped words. In comparison, version 1.5 shows a significant improvement in generalization, as claimed by the official release.