随记体验 · 2025年11月11日

CosyVoice 本地部署(macOS)笔记

file

按照官方说明,一步步来。

下载及安装:

git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
cd CosyVoice
git submodule update --init --recursive

Conda环境:

conda create -n cosyvoice -y python=3.10
conda activate cosyvoice
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com

模型下载:

git lfs 安装:

> brew install git-lfs
> git lfs install
# git模型下载,请确保已安装git lfs
mkdir -p pretrained_models
git clone https://www.modelscope.cn/iic/CosyVoice2-0.5B.git pretrained_models/CosyVoice2-0.5B
git clone https://www.modelscope.cn/iic/CosyVoice-300M.git pretrained_models/CosyVoice-300M
git clone https://www.modelscope.cn/iic/CosyVoice-300M-SFT.git pretrained_models/CosyVoice-300M-SFT
git clone https://www.modelscope.cn/iic/CosyVoice-300M-Instruct.git pretrained_models/CosyVoice-300M-Instruct
git clone https://www.modelscope.cn/iic/CosyVoice-ttsfrd.git pretrained_models/CosyVoice-ttsfrd

模型下载比较慢,体量也比较大,耐心等待下载完成。

运行Webui:

python3 webui.py --port 50000 --model_dir pretrained_models/CosyVoice-300M-Instruct

使用其它模型的话,Webui中预训练音色不显示:

image.png

使用issue中的临时解决办法:

import sys
import gradio as gr
sys.path.append('third_party/Matcha-TTS')
from cosyvoice.cli.cosyvoice import CosyVoice2
from cosyvoice.utils.file_utils import load_wav
import torchaudio
import torch

cosyvoice = CosyVoice2('pretrained_models/CosyVoice2-0.5B', load_jit=False, load_trt=False, fp16=False)

def generate_audio(audio_path, tts_text, instruct_text):
    if not audio_path or not tts_text or not instruct_text:
        return None

    prompt_speech = load_wav(audio_path, 16000)

    # 生成音频
    results = []
    for i, j in enumerate(cosyvoice.inference_instruct2(
        tts_text, 
        instruct_text,
        prompt_speech,
        stream=False
    )):
        output_path = f"output_{i}.wav"
        torchaudio.save(output_path, j['tts_speech'], cosyvoice.sample_rate)
        results.append(output_path)

    if not results:
        return None

    # 拼接所有音频
    waveforms = []
    for path in results:
        waveform, sr = torchaudio.load(path)
        waveforms.append(waveform)

    concatenated = torch.cat(waveforms, dim=1)
    output_path = "output_combined.wav"
    torchaudio.save(output_path, concatenated, cosyvoice.sample_rate)

    return output_path

with gr.Blocks(title="CosyVoice TTS") as app:
    gr.Markdown("## CosyVoice 语音合成系统")

    with gr.Row():
        with gr.Column():
            ref_audio = gr.Audio(label="参考音频", type="filepath")
            tts_text = gr.Textbox(label="合成文本", placeholder="输入要合成的文本...")
            instruct_text = gr.Textbox(label="风格指令", placeholder="输入语音风格指令...")
            generate_btn = gr.Button("生成语音", variant="primary")

        with gr.Column():
            audio_output = gr.Audio(label="生成结果", interactive=False)

    generate_btn.click(
        fn=generate_audio,
        inputs=[ref_audio, tts_text, instruct_text],
        outputs=audio_output
    )

if __name__ == "__main__":
    app.launch(server_name="0.0.0.0", server_port=7860, share=False)

然后运行:

> python3 web.py --port 50000 --model_dir pretrained_models/CosyVoice2-0.5B

然后需要在“风格指令处”输入<|endofprompt|>,就可以使用CosyVoice2的预训练模型了。

image.png