Bark 文字转语音¶

为什么要学 Bark¶

Bark 是 Suno AI 开发的开源文字转语音模型，能够生成高度自然的多语言语音，并支持笑声、叹息、音乐等非语言声音。与传统 TTS 不同，Bark 基于 Transformer 架构，生成的语音具有丰富的情感和韵律变化。对于创建播客、有声书、AI 助手语音、视频配音等场景，Bark 提供了免费且表现力极强的语音合成能力。

核心概念¶

概念	白话解释	用途
Voice Preset	语音预设	预定义的说话风格和音色
Semantic Tokens	语义标记	捕获语义和韵律信息
Coarse Tokens	粗粒度声学标记	生成基本音频特征
Fine Tokens	细粒度声学标记	精细化音频质量
EnCodec	音频编解码器	Meta 的神经音频编码器
Non-speech	非语言声音	笑声、音乐、停顿等

安装配置¶

安装¶

pip install git+https://github.com/suno-ai/bark.git

# 或通过 pip
pip install bark

# 环境变量配置
export SUNO_USE_SMALL_MODELS=True    # 使用小模型（节省显存）
export SUNO_OFFLOAD_CPU=True          # GPU 不足时卸载到 CPU

硬件要求¶

设备	小模型	大模型
GPU 显存	~4GB	~12GB
生成速度	快	中等
音质	良好	最佳

快速上手¶

基本语音生成¶

from bark import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav

# 预加载模型
preload_models()

# 生成语音
audio_array = generate_audio("Hello, this is a test of Bark text to speech.")

# 保存为 WAV
write_wav("output.wav", SAMPLE_RATE, audio_array)

中文语音¶

# 中文语音预设
audio = generate_audio(
    "你好，我是一个人工智能语音助手。今天天气很好。",
    history_prompt="v2/zh_speaker_1"
)
write_wav("chinese_output.wav", SAMPLE_RATE, audio)

使用不同语音预设¶

# 英文预设
audio = generate_audio("Hello world!", history_prompt="v2/en_speaker_3")

# 中文预设
audio = generate_audio("你好世界！", history_prompt="v2/zh_speaker_5")

# 日文预设
audio = generate_audio("こんにちは世界！", history_prompt="v2/ja_speaker_1")

# 可用语音预设：
# v2/en_speaker_0 ~ v2/en_speaker_9
# v2/zh_speaker_0 ~ v2/zh_speaker_9
# v2/ja_speaker_0 ~ v2/ja_speaker_9
# v2/de_speaker_0 ~ v2/de_speaker_9
# v2/fr_speaker_0 ~ v2/fr_speaker_9
# 等等...

非语言声音¶

# Bark 支持的特殊标记
texts = [
    "有时候... [laughs] 事情就是这么搞笑。",
    "[clears throat] 好了，我们开始正题。",
    "这是 [sighs] 很难的决定。",
    "♪ Do re mi fa sol la ti do ♪",  # 唱歌
    "— 等一下 —",                    # 停顿
]

for i, text in enumerate(texts):
    audio = generate_audio(text)
    write_wav(f"special_{i}.wav", SAMPLE_RATE, audio)

进阶用法¶

长文本处理¶

import numpy as np
from bark import generate_audio, SAMPLE_RATE
from scipy.io.wavfile import write as write_wav

def generate_long_audio(text, voice="v2/zh_speaker_3"):
    """将长文本分段生成然后拼接"""
    # 按句子分割
    sentences = text.replace("。", "。\n").replace("！", "！\n").replace("？", "？\n").split("\n")
    sentences = [s.strip() for s in sentences if s.strip()]

    audio_segments = []
    silence = np.zeros(int(0.3 * SAMPLE_RATE))  # 句间停顿

    for sentence in sentences:
        if len(sentence) > 200:  # 太长的句子再切分
            parts = [sentence[i:i+200] for i in range(0, len(sentence), 200)]
        else:
            parts = [sentence]

        for part in parts:
            audio = generate_audio(part, history_prompt=voice)
            audio_segments.append(audio)
            audio_segments.append(silence)

    # 拼接所有片段
    full_audio = np.concatenate(audio_segments)
    return full_audio

text = """
人工智能正在改变我们的生活。从自动驾驶到医疗诊断，AI的应用无处不在。
然而，我们也需要关注AI带来的伦理问题。隐私保护和算法公平性是重要的课题。
"""

audio = generate_long_audio(text)
write_wav("long_speech.wav", SAMPLE_RATE, audio)

语音克隆（自定义音色）¶

from bark.generation import load_codec_model, generate_text_semantic
from bark.api import semantic_to_waveform
import numpy as np

# 注意：Bark 的语音克隆能力有限
# 需要用参考音频生成 history prompt

# 方法1：使用 generate_audio 的输出作为 prompt
from bark import generate_audio

# 先用特定语音生成一段参考
reference_audio = generate_audio(
    "This is my reference voice sample.",
    history_prompt="v2/en_speaker_6"
)

# 保存为自定义 prompt（高级用法需要深入源码）

流式生成¶

from bark.generation import (
    generate_text_semantic,
    preload_models,
)
from bark.api import semantic_to_waveform
from bark import SAMPLE_RATE
import sounddevice as sd

preload_models()

def stream_speech(text, voice="v2/zh_speaker_3"):
    """逐句流式播放"""
    sentences = [s.strip() for s in text.split("。") if s.strip()]

    for sentence in sentences:
        sentence = sentence + "。"
        audio = generate_audio(sentence, history_prompt=voice)
        sd.play(audio, SAMPLE_RATE)
        sd.wait()

stream_speech("今天我们来讨论人工智能。机器学习是其中一个重要分支。深度学习推动了近年来的技术突破。")

部署为 API¶

from fastapi import FastAPI
from bark import generate_audio, SAMPLE_RATE, preload_models
from scipy.io.wavfile import write as write_wav
from fastapi.responses import FileResponse
import tempfile
import os

app = FastAPI()
preload_models()

@app.post("/tts")
async def text_to_speech(text: str, voice: str = "v2/zh_speaker_3"):
    audio = generate_audio(text, history_prompt=voice)

    tmp = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
    write_wav(tmp.name, SAMPLE_RATE, audio)

    return FileResponse(
        tmp.name,
        media_type="audio/wav",
        filename="speech.wav"
    )

批量生成¶

import json
from bark import generate_audio, SAMPLE_RATE
from scipy.io.wavfile import write as write_wav
from pathlib import Path

# 批量配置
tasks = [
    {"text": "第一章：引言", "voice": "v2/zh_speaker_1", "output": "ch1_intro.wav"},
    {"text": "第二章：方法", "voice": "v2/zh_speaker_1", "output": "ch2_method.wav"},
    {"text": "第三章：结论", "voice": "v2/zh_speaker_1", "output": "ch3_conclusion.wav"},
]

output_dir = Path("./audiobook")
output_dir.mkdir(exist_ok=True)

for task in tasks:
    print(f"Generating: {task['output']}")
    audio = generate_audio(task["text"], history_prompt=task["voice"])
    write_wav(str(output_dir / task["output"]), SAMPLE_RATE, audio)

常见问题¶

Q: 生成速度很慢？¶

设置 SUNO_USE_SMALL_MODELS=True 使用小模型
使用 GPU 加速
每次生成的文本控制在 1-2 句话
考虑使用 bark.cpp C++ 加速版本

Q: 生成的语音有杂音？¶

多生成几次，Bark 有随机性
使用不同的 voice preset
确保文本不超过 200 字符
添加标点符号帮助断句

Q: 与其他 TTS 的区别？¶

Bark：最自然、支持非语言声音，但不可控（随机性）
Coqui TTS：更可控、支持微调、部署更灵活
Edge TTS：免费在线服务、延迟低、但需网络
piper：最轻量、离线、实时，但表现力有限

Q: 如何保证每次生成一致？¶

# 设置随机种子
import torch
torch.manual_seed(42)

audio = generate_audio("固定语音", history_prompt="v2/zh_speaker_3")

参考资源¶

GitHub：https://github.com/suno-ai/bark
Hugging Face：https://huggingface.co/suno/bark
bark.cpp：https://github.com/PABannier/bark.cpp
Suno AI：https://www.suno.ai/
语音预设列表：https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683