https://llasatts.github.io/llasatts/ Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis Abstract.Recent advances in text-based large language models (LLMs), particularly in the GPT series and the o1 model, have demonstrated the effectiveness of scaling both training-time and inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate models (e.g., diffusion models after LLM), complicating the decision of whether to scale a particular model during training or testing. This work makes the following contributions: First, we explore the scaling of train-time and inference-time compute for speech synthesis. Second, we propose a simple framework LLaSA for speech synthesis that employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align with standard LLMs such as LLaMA. Our experiments reveal that scaling train-time compute for LLaSA consistently improves the naturalness of synthesized speech and enables the generation of more complex and accurate prosody patterns. Furthermore, from the perspective of scaling inference-time compute, we employ speech understanding models as verifiers during the search, finding that scaling inference-time compute shifts the sampling modes toward the preferences of specific verifiers, thereby improving emotional expressiveness, timbre consistency, and content accuracy. In addition, we released the checkpoint and training code for our TTS model (1B, 3B, 8B) and codec model publicly available. Contents * Scaling Train-Time Compute * Scaling Inference-Time Compute * Codec Reconstruction Samples Comparision Inference-Time scaling results using a different evaluation metric The left figure uses different speaker embedding model speechbrain/ spkrec-ecapa-voxceleb as a reference evalution metric for speaker similarity. The right figure is the original fig.2. Image 1 Image 2 Comparision Results on Ravdess Benchmark Ravdess has only two texts: "Dogs are sitting by the door." for prompt text, and "Kids are talking by the door." for synthesis text. The following results for NaturalSpeech 3, NaturalSpeech 2, Voicebox (R), VALL-E (R), Mega-TTS 2, StyleTTS 2, and HierSpeech++ are taken from the official NaturalSpeech 3 demo page. (R) indicates that these are reproduced by NaturalSpeech 3. Prompt Prompt Ground Llasa-1b-250k Llasa-3b-250k Llasa-8b-250k FireRedTTS F5-TTS MaskGCT E2-TTS CosyVoice2 CosyVoice NaturalSpeech NaturalSpeech Voicebox VALL-E Mega-TTS StyleTTS HierSpeech++ Emotion Truth 3 2 (R) (R) 2 2 Your Your Your Your Your Your Your Your Your Your Your Your browser browser Your browser Your browser Your browser browser browser browser browser browser browser Your browser Your browser browser browser browser browser Your browser does not does not does not does not does not does not does not does not does not does not does not does not does not does not does not does not does not does not neutral support support support the support the support the support support support support support support support the support the support support support support support the the the audio audio audio the audio the the the the audio the audio audio audio the the the the audio audio audio element. element. element. element. audio audio audio element. element. element. element. audio audio audio audio element. element. element. element. element. element. element. element. element. element. Your Your Your Your Your Your Your Your Your Your Your Your browser browser Your browser Your browser Your browser browser browser browser browser browser browser Your browser Your browser browser browser browser browser Your browser does not does not does not does not does not does not does not does not does not does not does not does not does not does not does not does not does not does not happy support support support the support the support the support support support support support support support the support the support support support support support the the the audio audio audio the audio the the the the audio the audio audio audio the the the the audio audio audio element. element. element. element. audio audio audio element. element. element. element. audio audio audio audio element. element. element. element. element. element. element. element. element. element. Your Your Your Your Your Your Your Your Your Your Your Your browser browser Your browser Your browser Your browser browser browser browser browser browser browser Your browser Your browser browser browser browser browser Your browser does not does not does not does not does not does not does not does not does not does not does not does not does not does not does not does not does not does not calm support support support the support the support the support support support support support support support the support the support support support support support the the the audio audio audio the audio the the the the audio the audio audio audio the the the the audio audio audio element. element. element. element. audio audio audio element. element. element. element. audio audio audio audio element. element. element. element. element. element. element. element. element. element. Your Your Your Your Your Your Your Your Your Your Your Your browser browser Your browser Your browser Your browser browser browser browser browser browser browser Your browser Your browser browser browser browser browser Your browser does not does not does not does not does not does not does not does not does not does not does not does not does not does not does not does not does not does not sad support support support the support the support the support support support support support support support the support the support support support support support the the the audio audio audio the audio the the the the audio the audio audio audio the the the the audio audio audio element. element. element. element. audio audio audio element. element. element. element. audio audio audio audio element. element. element. element. element. element. element. element. element. element. Your Your Your Your Your Your Your Your Your Your Your Your browser browser Your browser Your browser Your browser browser browser browser browser browser browser Your browser Your browser browser browser browser browser Your browser does not does not does not does not does not does not does not does not does not does not does not does not does not does not does not does not does not does not angry support support support the support the support the support support support support support support support the support the support support support support support the the the audio audio audio the audio the the the the audio the audio audio audio the the the the audio audio audio element. element. element. element. audio audio audio element. element. element. element. audio audio audio audio element. element. element. element. element. element. element. element. element. element. Your Your Your Your Your Your Your Your Your Your Your Your browser browser Your browser Your browser Your browser browser browser browser browser browser browser Your browser Your browser browser browser browser browser Your browser does not does not does not does not does not does not does not does not does not does not does not does not does not does not does not does not does not does not fearful support support support the support the support the support support support support support support support the support the support support support support support the the the audio audio audio the audio the the the the audio the audio audio audio the the the the audio audio audio element. element. element. element. audio audio audio element. element. element. element. audio audio audio audio element. element. element. element. element. element. element. element. element. element. Your Your Your Your Your Your Your Your Your Your Your Your browser browser Your browser Your browser Your browser browser browser browser browser browser browser Your browser Your browser browser browser browser browser Your browser does not does not does not does not does not does not does not does not does not does not does not does not does not does not does not does not does not does not disgust support support support the support the support the support support support support support support support the support the support support support support support the the the audio audio audio the audio the the the the audio the audio audio audio the the the the audio audio audio element. element. element. element. audio audio audio element. element. element. element. audio audio audio audio element. element. element. element. element. element. element. element. element. element. Your Your Your Your Your Your Your Your Your Your Your Your browser browser Your browser Your browser Your browser browser browser browser browser browser browser Your browser Your browser browser browser browser browser Your browser does not does not does not does not does not does not does not does not does not does not does not does not does not does not does not does not does not does not surprised support support support the support the support the support support support support support support support the support the support support support support support the the the audio audio audio the audio the the the the audio the audio audio audio the the the the audio audio audio element. element. element. element. audio audio audio element. element. element. element. audio audio audio audio element. element. element. element. element. element. element. element. element. element. Scaling Train-Time Compute We randomly selected two samples from the English test set. All synthesized audio was generated solely from the input text (without any speech prompts), and each model was sampled three times at random to specifically evaluate its text comprehension ability. The table below presents the results across models of various sizes and training data amounts. Sample Llasa-1b-80k Llasa-1b-160k Llasa-1b-250k Llasa-3b-250k Llasa-8b-250k "Uh, are you sure about this?" Tim asked nervously, looking at the steep slope before them. "Whoa, it's higher than I thought," he continued, his voice filled with trepidation. "Aha, but look at the view," Emily responded with excitement, "it's worth the climb!" Your browser Your browser Your browser Your browser Your browser Random does not does not does not does not does not Sample support the support the support the support the support the 1 audio audio audio audio audio element. element. element. element. element. Your browser Your browser Your browser Your browser Your browser Random does not does not does not does not does not Sample support the support the support the support the support the 2 audio audio audio audio audio element. element. element. element. element. Your browser Your browser Your browser Your browser Your browser Random does not does not does not does not does not Sample support the support the support the support the support the 3 audio audio audio audio audio element. element. element. element. element. Her hands shaking with excitement, Alice Monroe stuttered, "oh..I-I can't believe it! Is this really my acceptance letter to Harvard?" Marco cannot believe it either: "God damn it! How did you pull this off?" Your browser Your browser Your browser Your browser Your browser Random does not does not does not does not does not Sample support the support the support the support the support the 1 audio audio audio audio audio element. element. element. element. element. Your browser Your browser Your browser Your browser Your browser Random does not does not does not does not does not Sample support the support the support the support the support the 2 audio audio audio audio audio element. element. element. element. element. Your browser Your browser Your browser Your browser Your browser Random does not does not does not does not does not Sample support the support the support the support the support the 3 audio audio audio audio audio element. element. element. element. element. Two samples selected from the Chinese test set randomly. Sample Llasa-1b-80k Llasa-1b-160k Llasa-1b-250k Llasa-3b-250k Llasa-8b-250k Lian Wai Yu Chan Chan ,Chun Yi Lan Shan . Luo Qin Bu Nai Wu Geng Han . Meng Li Bu Zhi Shen Shi Ke ,Yi Shang Tan Huan . Du Zi Mo Ping Lan ,Wu Xian Jiang Shan . Bie Shi Rong Yi Jian Shi Nan . Liu Shui Luo Hua Chun Qu Ye ,Tian Shang Ren Jian . Your browser Your browser Your browser Your browser Your browser Random does not does not does not does not does not Sample support the support the support the support the support the 1 audio audio audio audio audio element. element. element. element. element. Your browser Your browser Your browser Your browser Your browser Random does not does not does not does not does not Sample support the support the support the support the support the 2 audio audio audio audio audio element. element. element. element. element. Your browser Your browser Your browser Your browser Your browser Random does not does not does not does not does not Sample support the support the support the support the support the 3 audio audio audio audio audio element. element. element. element. element. Ren Yao Shi Xing ,Gan Yi Xing Xing Yi Xing ,Yi Xing Xing Xing Xing Xing ,Xing Xing Xing Gan Na Xing Du Xing ,Yao Shi Bu Xing ,Gan Yi Xing Bu Xing Yi Xing ,Yi Xing Bu Xing Xing Xing Bu Xing ,Xing Xing Bu Xing ,Gan Na Xing Du Bu Xing . Your browser Your browser Your browser Your browser Your browser Random does not does not does not does not does not Sample support the support the support the support the support the 1 audio audio audio audio audio element. element. element. element. element. Your browser Your browser Your browser Your browser Your browser Random does not does not does not does not does not Sample support the support the support the support the support the 2 audio audio audio audio audio element. element. element. element. element. Your browser Your browser Your browser Your browser Your browser Random does not does not does not does not does not Sample support the support the support the support the support the 3 audio audio audio audio audio element. element. element. element. element. Scaling Inference-Time Compute Using the Llasa-1b-250k model, we compared the results of direct inference and inference-time scaling. The two examples shown below were randomly selected from the seed-tts-eval test-hard. Directly Scaling Target Text Prompt Inference Inference-Time Compute La Ma Yu Ya Ba Da Nan Bian Lai Liao Ge Ya Ba ,Yao Li Bie Liao Ge La Ba ; Da Bei Bian Lai Liao Ge La Ma ,Shou Li Ti Your Your Liao Ge Ta Ma . Ti Zhao Ta Ma De La Ma Yao Na Ta Ma Huan browser browser Your browser Bie Zhao La Ba De Ya Ba De La Ba ; Bie Zhao La Ba De Ya does not does not does not Ba Bu Yuan Na La Ba Huan Ti Zhao Ta Ma De La Ma De Ta support support support the Ma . Bu Zhi Shi Bie Zhao La Ba De Ya Ba Da Liao Ti Zhao Ta the the audio audio element. Ma De La Ma Yi La Ba ; Huan Shi Ti Zhao Ta Ma De La Ma audio element. Da Liao Bie Zhao La Ba De Ya Ba Yi Ta Ma . La Ma Hui Jia element. Tun Ta Ma ; Ya Ba Di Di Da Da Chui La Ba Gao Gao Shan Shang Yi Zuo Miao ,Zhu Liao Ba Ge Chu Jia Ren ,Ba Ge Dao Ren Du You Ming :Da Di Zi ,Jiao Deng Da ,Er Di Zi ,Jiao Da Deng ,San Di Zi ,Jiao Hou San ,Si Di Zi ,Jiao San Hou ,Wu Di Zi ,Jiao Ping Cha ,Liu Di Zi , Jiao Cha Ping ,Qi Di Zi ,Jiao Bing Bie Bian ,Ba Di Zi , Jiao Bian Bie Bing . Deng Da Hui Da Gu ,Da Deng Hui Zhuang Zhong , Hou San Hui Shao Huo ,San Hou Hui Dian Deng ;Ping Cha Hui Chui Guan ,Cha Ping Hui Chui Sheng ;Bing Bie Bian Hui Zhu Fan ,Bian Bie Bing Hui Nian Jing . Da Deng Yao Da Deng Da Gu ,Deng Da Yao Zhuang Da Your Your Deng Zhong ;San Hou Yao Shao Hou San Huo ,Hou San Yao Dian San Hou browser browser Your browser Deng ;Cha Ping Yao Chui Ping Cha Guan ,Ping Cha Yao Chui Cha Ping Sheng does not does not does not ;Bian Bie Bing Yao Zhu Bing Bie Bian De Fan ,Bing Bie Bian Yao Nian support support support the Bian Bie Bing De Jing . Da Deng Da Bu Hao Deng Da De Gu ,Deng the the audio audio element. Da Zhuang Bu Hao Da Deng De Zhong ;San Hou Shao Bu Hao Hou San De audio element. Huo ,Hou San Dian Bu Hao San Hou De Deng ;Cha Ping Chui Bu Hao element. Ping Cha De Guan ,Ping Cha Chui Bu Hao Cha Ping De Sheng ;Bian Bie Bing Zhu Bu Hao Bing Bie Bian De Fan ,Bing Bie Bian Nian Bu Hao Bian Bie Bing De Jing . Deng Da Huan Da Deng Da Gu ,Da Deng Huan Zhuang Da Deng Zhong ;Hou San Huan Shao Hou San Huo ,San Hou Huan Dian San Hou Deng ;Ping Cha Huan Chui Ping Cha Guan ,Cha Ping Huan Chui Cha Ping Sheng ;Bing Bie Bian Huan Zhu Bing Bie Bian De Fan ,Bian Bie Bing Huan Nian Bian Bie Bing De Jing . Ge Ren Huan Gan Ge Yi Xing ,Bai Bai Zheng Ge Lian Hong Bo Zi Qing . We use Llasa-1b-250k for continuation experiment on the LibriSpeech test-clean dataset. The generated audio for each sample starts with the first 3 seconds of the ground truth audio, followed by the model's generated continuation. Ground Truth Directly Inference Scaling Inference-Time Compute Your browser does not Your browser does not Your browser does not support the audio support the audio support the audio element. element. element. Your browser does not Your browser does not Your browser does not support the audio support the audio support the audio element. element. element. Your browser does not Your browser does not Your browser does not support the audio support the audio support the audio element. element. element. Your browser does not Your browser does not Your browser does not support the audio support the audio support the audio element. element. element. Your browser does not Your browser does not Your browser does not support the audio support the audio support the audio element. element. element. Codec Reconstruction Samples Sample GT Xcodec2 StableCodec WavTokenizer_40tps WavTokenizer_75tps Xcodec_nq1 Xcodec_nq2 BigCodec DAC_16k_nq1 DAC_16k_nq2 DAC_16k_nq12 Encodec_nq2 Encodec_nq8 Mimi_nq4 Mimi_nq6 Mimi_nq8 SemanticCodec SpeechTokenizer_nq1 SpeechTokenizer_nq2 Your Your Your Your Your Your Your Your Your Your Your Your Your browser browser browser browser browser browser browser browser Your browser browser browser browser browser browser Your browser Sample does not does not does not Your browser does Your browser does does not does not does not does not does not does not does not does not does not does not does not does not Your browser does Your browser does 1 support support support the not support the not support the support support support support the support the support the support the support the support support support support the not support the not support the the the audio audio element. audio element. the audio the audio the audio audio audio audio audio the the the audio audio element. audio element. audio audio element. element. element. audio element. element. element. element. element. audio audio audio element. element. element. element. element. element. element. Your Your Your Your Your Your Your Your Your Your Your Your Your browser browser browser browser browser browser browser browser Your browser browser browser browser browser browser Your browser Sample does not does not does not Your browser does Your browser does does not does not does not does not does not does not does not does not does not does not does not does not Your browser does Your browser does 2 support support support the not support the not support the support support support support the support the support the support the support the support support support support the not support the not support the the the audio audio element. audio element. the audio the audio the audio audio audio audio audio the the the audio audio element. audio element. audio audio element. element. element. audio element. element. element. element. element. audio audio audio element. element. element. element. element. element. element. Your Your Your Your Your Your Your Your Your Your Your Your Your browser browser browser browser browser browser browser browser Your browser browser browser browser browser browser Your browser Sample does not does not does not Your browser does Your browser does does not does not does not does not does not does not does not does not does not does not does not does not Your browser does Your browser does 3 support support support the not support the not support the support support support support the support the support the support the support the support support support support the not support the not support the the the audio audio element. audio element. the audio the audio the audio audio audio audio audio the the the audio audio element. audio element. audio audio element. element. element. audio element. element. element. element. element. audio audio audio element. element. element. element. element. element. element. Your Your Your Your Your Your Your Your Your Your Your Your Your browser browser browser browser browser browser browser browser Your browser browser browser browser browser browser Your browser Sample does not does not does not Your browser does Your browser does does not does not does not does not does not does not does not does not does not does not does not does not Your browser does Your browser does 4 support support support the not support the not support the support support support support the support the support the support the support the support support support support the not support the not support the the the audio audio element. audio element. the audio the audio the audio audio audio audio audio the the the audio audio element. audio element. audio audio element. element. element. audio element. element. element. element. element. audio audio audio element. element. element. element. element. element. element.